Data rescue of historical tables through semi-supervised table structure recognition
Data rescue of historical tables through semi-supervised table structure recognition
This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub.
NLP, Document Layout Analysis, Document Analysis, Machine Learning
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Loitongbam, Gyanendro and Middleton, Stuart E
(2025)
Data rescue of historical tables through semi-supervised table structure recognition.
International Journal on Document Analysis and Recognition.
(In Press)
Abstract
This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub.
Text
IJDAR_2025_accepted_17_11_2025
- Accepted Manuscript
Restricted to Repository staff only until 17 November 2026.
Request a copy
More information
Accepted/In Press date: 17 November 2025
Keywords:
NLP, Document Layout Analysis, Document Analysis, Machine Learning
Identifiers
Local EPrints ID: 507139
URI: http://eprints.soton.ac.uk/id/eprint/507139
ISSN: 1433-2833
PURE UUID: b1a83d4f-fc5e-4823-aac1-623a5f00a048
Catalogue record
Date deposited: 27 Nov 2025 17:53
Last modified: 28 Nov 2025 02:36
Export record
Contributors
Author:
Gyanendro Loitongbam
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics