The University of Southampton
University of Southampton Institutional Repository

Finding parallel passages in cultural heritage archives

Finding parallel passages in cultural heritage archives
Finding parallel passages in cultural heritage archives

It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this article, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimized suffix tree, and generalized edit distance. A comprehensive evaluation through crowdsourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.

Digital archives, Information retrieval, Statistical language models, Suffix trees
1556-4673
Harris, Martyn
3f289d34-8220-4877-a3c2-22af3e08c5b3
Levene, Mark
4ad83ded-d4b9-40eb-a795-b2382a9a296a
Zhang, Dell
ae078ed1-bc72-431f-a6c9-eaaf9c73e946
Levene, Dan
fdf6fd40-020a-4cbb-b953-d5c2dcc6a002
Harris, Martyn
3f289d34-8220-4877-a3c2-22af3e08c5b3
Levene, Mark
4ad83ded-d4b9-40eb-a795-b2382a9a296a
Zhang, Dell
ae078ed1-bc72-431f-a6c9-eaaf9c73e946
Levene, Dan
fdf6fd40-020a-4cbb-b953-d5c2dcc6a002

Harris, Martyn, Levene, Mark, Zhang, Dell and Levene, Dan (2018) Finding parallel passages in cultural heritage archives. Journal on Computing and Cultural Heritage, 11 (3). (doi:10.1145/3195727).

Record type: Article

Abstract

It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this article, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimized suffix tree, and generalized edit distance. A comprehensive evaluation through crowdsourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.

Full text not available from this repository.

More information

Accepted/In Press date: 1 February 2018
e-pub ahead of print date: 5 September 2018
Published date: September 2018
Keywords: Digital archives, Information retrieval, Statistical language models, Suffix trees

Identifiers

Local EPrints ID: 426911
URI: https://eprints.soton.ac.uk/id/eprint/426911
ISSN: 1556-4673
PURE UUID: 7ac89fbf-eaf7-4235-aae5-8bb801512cea

Catalogue record

Date deposited: 14 Dec 2018 17:30
Last modified: 12 Jul 2019 17:01

Export record

Altmetrics

Contributors

Author: Martyn Harris
Author: Mark Levene
Author: Dell Zhang
Author: Dan Levene

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×