Levene, Dan, Levene, Mark, Martyn, Harris and Zhang, Dell (2019) Comparing "parallel passages'' in Digital Archives. Journal of Documentation. (doi:10.1108/JD-10-2018-0175).
Abstract
The term “parallel passage” refers to identical, or approximate text patterns of variable length, which could be regarded as semantically equivalent. “Parallel passages” represent alternative surface representations that exhibit identical wording, such as those representing reported speech and direct quotations, or with some small variation in grammatical structure, or vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary choice may be the result of synonymy, or hyperonymy where a general or higher-level concept has been selected [Madnani and Dorr, 2010]. On the other hand, paraphrasing on the part of the author may provide evidence of text-reuse, or intertextuality [Fairclough, 1992], where the author has summarised the main concepts, or meaning, encoded by one or more texts that preceded it. Further differences between passages may arise due to a shift in authorship, dialect, the natural evolution of language over time [B¨uchler, 2010], and errors introduced by optical character recognition (OCR) during the digitisation. The task of comparing equivalent or similar shared text patterns in text corpora stored in digital archives, has become increasingly challenging and time consuming due to the current scale of digital text data, which makes the task of comparing shared text patterns across multiple documents practically impossible to do manually. Identifying parallel passages, such as those exemplified by paraphrases, also supports a range of natural language tasks, including text generation, information retrieval and extraction, and summarisation. This paper presents an overview of the text mining tools developed to compare parallel passages, which were deployed in a system known as the Samtla (Search And Mining Tools for Language Archives), which was developed to support the research of historic and cultural heritage collections of documents stored in digital archives. The paper is organised as follows, in Section 1, we review the related work. Section 3 describes the corpora used as test cases to explore the results generated by our proposed approach. We provide a description of the model used as a basis for extracting and scoring the contents of documents according to their shared-text patterns in Section 4. In Section 5, we describe the approach used for identifying related documents according to our proposed model, where we measure the similarity of pairs of documents based on their character-level n-gram probability distributions. Section 6 presents an approach for visualising local similarities between the content of related documents in the form of variable length parallel passages extracted from the document content. We briefly discuss the motivation behind the user interface in Section 7, and some of the language and corpus dependent issues that the document comparison tool addresses to demonstrate the flexibility of the approach to different domains, languages, authors, and time periods in Section 7. We conclude the paper with a summary of the work in Section 9, and future research and development.
This record has no associated files available for download.
More information
Identifiers
Catalogue record
Export record
Altmetrics
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.