Finding parallel passages in cultural heritage archives

It is of great interest to researchers and scholars in many disciplines (particularly those working on cultural heritage projects) to study parallel passages (i.e., identical or similar pieces of text describing the same thing) in digital text archives. Although there exist a few software tools for this purpose, they are restricted to a specific domain (e.g., the Bible) or a specific language (e.g., Hebrew). In this article, we present in detail how we build a digital infrastructure that can facilitate the search and discovery of parallel passages for any domain in any language. It is at the core of our Samtla (Search And Mining Tools with Linguistic Analysis) system designed in collaboration with historians and linguists. The system has already been used to support research on five large text corpora that span a number of different domains and languages. The key to such a domain-independent and language-independent digital infrastructure is a novel combination of a character-based n-gram language model, space-optimized suffix tree, and generalized edit distance. A comprehensive evaluation through crowdsourcing shows that the effectiveness of our system's search functionality is on par with the human-level performance.

Digital archives, Information retrieval, Statistical language models, Suffix trees

10.1145/3195727

1556-4673

Harris, Martyn

3f289d34-8220-4877-a3c2-22af3e08c5b3

Levene, Mark

4ad83ded-d4b9-40eb-a795-b2382a9a296a

Zhang, Dell

ae078ed1-bc72-431f-a6c9-eaaf9c73e946

Levene, Dan

fdf6fd40-020a-4cbb-b953-d5c2dcc6a002

September 2018