Tabular context-aware optical character recognition and tabular data reconstruction for historical records
Tabular context-aware optical character recognition and tabular data reconstruction for historical records
Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.
AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
1 July 2025
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Loitongbam, Gyanendro and Middleton, Stuart E
(2025)
Tabular context-aware optical character recognition and tabular data reconstruction for historical records.
International Journal on Document Analysis and Recognition.
(doi:10.1007/s10032-025-00543-9).
Abstract
Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.
Text
OCR_2024__Final_ICDAR_2025_
- Accepted Manuscript
Text
s10032-025-00543-9 (1)
- Version of Record
More information
Accepted/In Press date: 10 June 2025
Published date: 1 July 2025
Keywords:
AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition
Identifiers
Local EPrints ID: 502849
URI: http://eprints.soton.ac.uk/id/eprint/502849
ISSN: 1433-2833
PURE UUID: 91b70760-fe8b-484b-86ab-036d623905b3
Catalogue record
Date deposited: 09 Jul 2025 16:39
Last modified: 11 Sep 2025 01:55
Export record
Altmetrics
Contributors
Author:
Gyanendro Loitongbam
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics