Tabular context-aware optical character recognition and tabular data reconstruction for historical records

Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS Data Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel contextaware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5 for real-time post-OCR correction, providing improved multilingual support. This pipeline reduces errors encountered in table digitization tasks by correcting OCR outputs in real time during training. The model achieves superior performance with a 0.049 word error rate and 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.

AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition

10.1007/s10032-025-00543-9

1433-2833

Loitongbam, Gyanendro

c1d8ea4f-7a54-4c78-8830-3c3064e26ae6

Middleton, Stuart E

404b62ba-d77e-476b-9775-32645b04473f

1 July 2025

Loitongbam, Gyanendro

c1d8ea4f-7a54-4c78-8830-3c3064e26ae6

Middleton, Stuart E

404b62ba-d77e-476b-9775-32645b04473f

Loitongbam, Gyanendro and Middleton, Stuart E (2025) Tabular context-aware optical character recognition and tabular data reconstruction for historical records. International Journal on Document Analysis and Recognition. (doi:10.1007/s10032-025-00543-9).

Record type: Article

Abstract

Text

OCR_2024__Final_ICDAR_2025_ - Accepted Manuscript

Available under License Creative Commons Attribution.

Download (5MB)

Text

s10032-025-00543-9 (1) - Version of Record

Available under License Creative Commons Attribution.

Download (1MB)

More information

Accepted/In Press date: 10 June 2025

Published date: 1 July 2025

Keywords: AI, Data Rescue, NLP, OCR, Optical Character Recognition, Historical Document Analysis, Semi-Supervised Learning, Data Annotation, Tabular Structure Recognition

Learn more about the Agents, Interactions and Complexity