Data rescue of historical tables through semi-supervised table structure recognition

This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub.

NLP, Document Layout Analysis, Document Analysis, Machine Learning, Optical Character Recognition, OCR

10.1007/s10032-025-00562-6

1433-2833

Loitongbam, Gyanendro

c1d8ea4f-7a54-4c78-8830-3c3064e26ae6

Middleton, Stuart E

404b62ba-d77e-476b-9775-32645b04473f

1 December 2025

Loitongbam, Gyanendro

c1d8ea4f-7a54-4c78-8830-3c3064e26ae6

Middleton, Stuart E

404b62ba-d77e-476b-9775-32645b04473f

Loitongbam, Gyanendro and Middleton, Stuart E (2025) Data rescue of historical tables through semi-supervised table structure recognition. International Journal on Document Analysis and Recognition. (doi:10.1007/s10032-025-00562-6).

Record type: Article