Comparing "parallel passages'' in Digital Archives

The term “parallel passage” refers to identical, or approximate text patterns of variable length, which could be regarded as semantically equivalent. “Parallel passages” represent alternative surface representations that exhibit identical wording, such as those representing reported speech and direct quotations, or with some small variation in grammatical structure, or vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary choice may be the result of synonymy, or hyperonymy where a general or higher-level concept has been selected [Madnani and Dorr, 2010]. On the other hand, paraphrasing on the part of the author may provide evidence of text-reuse, or intertextuality [Fairclough, 1992], where the author has summarised the main concepts, or meaning, encoded by one or more texts that preceded it. Further differences between passages may arise due to a shift in authorship, dialect, the natural evolution of language over time [B¨uchler, 2010], and errors introduced by optical character recognition (OCR) during the digitisation. The task of comparing equivalent or similar shared text patterns in text corpora stored in digital archives, has become increasingly challenging and time consuming due to the current scale of digital text data, which makes the task of comparing shared text patterns across multiple documents practically impossible to do manually. Identifying parallel passages, such as those exemplified by paraphrases, also supports a range of natural language tasks, including text generation, information retrieval and extraction, and summarisation. This paper presents an overview of the text mining tools developed to compare parallel passages, which were deployed in a system known as the Samtla (Search And Mining Tools for Language Archives), which was developed to support the research of historic and cultural heritage collections of documents stored in digital archives. The paper is organised as follows, in Section 1, we review the related work. Section 3 describes the corpora used as test cases to explore the results generated by our proposed approach. We provide a description of the model used as a basis for extracting and scoring the contents of documents according to their shared-text patterns in Section 4. In Section 5, we describe the approach used for identifying related documents according to our proposed model, where we measure the similarity of pairs of documents based on their character-level n-gram probability distributions. Section 6 presents an approach for visualising local similarities between the content of related documents in the form of variable length parallel passages extracted from the document content. We briefly discuss the motivation behind the user interface in Section 7, and some of the language and corpus dependent issues that the document comparison tool addresses to demonstrate the flexibility of the approach to different domains, languages, authors, and time periods in Section 7. We conclude the paper with a summary of the work in Section 9, and future research and development.

archives, Data mining, Humanities

10.1108/JD-10-2018-0175

Levene, Dan

fdf6fd40-020a-4cbb-b953-d5c2dcc6a002

Levene, Mark

c6f33387-824f-43c3-bec0-30949fe2cc8b

Martyn, Harris

d09c1477-0641-4e72-b25d-bc927c9fd4c0

Zhang, Dell

df7b0ed5-137a-4832-b03f-75cd92c8e83c

Levene, Dan

fdf6fd40-020a-4cbb-b953-d5c2dcc6a002

Levene, Mark

c6f33387-824f-43c3-bec0-30949fe2cc8b

Martyn, Harris

d09c1477-0641-4e72-b25d-bc927c9fd4c0

Zhang, Dell

df7b0ed5-137a-4832-b03f-75cd92c8e83c

Levene, Dan, Levene, Mark, Martyn, Harris and Zhang, Dell (2019) Comparing "parallel passages'' in Digital Archives. Journal of Documentation. (doi:10.1108/JD-10-2018-0175).

Record type: Article

Abstract

This record has no associated files available for download.

More information

Accepted/In Press date: 17 July 2019

e-pub ahead of print date: 2 September 2019

Keywords: archives, Data mining, Humanities

Learn more about the History

Identifiers

Local EPrints ID: 436352

URI: http://eprints.soton.ac.uk/id/eprint/436352

DOI: doi:10.1108/JD-10-2018-0175

PURE UUID: 94d9b080-8edc-436e-830e-1dca3c4fa91b

Catalogue record

Date deposited: 06 Dec 2019 17:30

Last modified: 06 Mar 2026 06:00

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Dan Levene

Author: Mark Levene

Author: Harris Martyn

Author: Dell Zhang

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information