The University of Southampton
University of Southampton Institutional Repository

Tabular context-aware optical character recognition and tabular data reconstruction for historical records

Tabular context-aware optical character recognition and tabular data reconstruction for historical records
Tabular context-aware optical character recognition and tabular data reconstruction for historical records
Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.
AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition
1433-2833
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f

Loitongbam, Gyanendro and Middleton, Stuart E (2025) Tabular context-aware optical character recognition and tabular data reconstruction for historical records. International Journal on Document Analysis and Recognition. (doi:10.1007/s10032-025-00543-9).

Record type: Article

Abstract

Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.

Text
OCR_2024__Final_ICDAR_2025_ - Accepted Manuscript
Available under License Creative Commons Attribution.
Download (5MB)
Text
s10032-025-00543-9 (1) - Version of Record
Available under License Creative Commons Attribution.
Download (1MB)

More information

Accepted/In Press date: 10 June 2025
Published date: 1 July 2025
Keywords: AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition

Identifiers

Local EPrints ID: 502849
URI: http://eprints.soton.ac.uk/id/eprint/502849
ISSN: 1433-2833
PURE UUID: 91b70760-fe8b-484b-86ab-036d623905b3
ORCID for Stuart E Middleton: ORCID iD orcid.org/0000-0001-8305-8176

Catalogue record

Date deposited: 09 Jul 2025 16:39
Last modified: 11 Sep 2025 01:55

Export record

Altmetrics

Contributors

Author: Gyanendro Loitongbam

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×