The University of Southampton
University of Southampton Institutional Repository

Data rescue of historical tables through semi-supervised table structure recognition

Data rescue of historical tables through semi-supervised table structure recognition
Data rescue of historical tables through semi-supervised table structure recognition
This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub.
NLP, Document Layout Analysis, Document Analysis, Machine Learning
1433-2833
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f
Loitongbam, Gyanendro
c1d8ea4f-7a54-4c78-8830-3c3064e26ae6
Middleton, Stuart E
404b62ba-d77e-476b-9775-32645b04473f

Loitongbam, Gyanendro and Middleton, Stuart E (2025) Data rescue of historical tables through semi-supervised table structure recognition. International Journal on Document Analysis and Recognition. (In Press)

Record type: Article

Abstract

This study uses a novel semi-supervised learning framework to explore Tabular Structure Recognition (TSR) for digitizing historical documents, specifically employing the CascadeTabNet model. TSR is crucial for transforming archival tabular data into digital formats, enhancing accessibility and analysis across various research fields. Challenges like physical degradation, inconsistent lighting, and non-standard handwriting hinder the generation of high-quality annotations of historical documents needed for effective model training. To address these issues, this research explores two research questions: (i) Can a semi-supervised training approach reduce the need for expensive data annotations? and (ii) Does semi-supervised training improve model robustness? We applied our methodology across three datasets: the GloSAT and ICDAR-2019 datasets based on historical documents, and the predominantly modern documents PubTabNet dataset. Our results indicate that semi-supervised learning substantially increases TSR accuracy and decreases dependency on extensive labelled datasets, providing a robust solution for large-scale digitization initiatives and contributing to the preservation and improved accessibility of historical data. All code from this paper is freely available on GitHub.

Text
IJDAR_2025_accepted_17_11_2025 - Accepted Manuscript
Restricted to Repository staff only until 17 November 2026.
Available under License Creative Commons Attribution.
Request a copy

More information

Accepted/In Press date: 17 November 2025
Keywords: NLP, Document Layout Analysis, Document Analysis, Machine Learning

Identifiers

Local EPrints ID: 507139
URI: http://eprints.soton.ac.uk/id/eprint/507139
ISSN: 1433-2833
PURE UUID: b1a83d4f-fc5e-4823-aac1-623a5f00a048
ORCID for Stuart E Middleton: ORCID iD orcid.org/0000-0001-8305-8176

Catalogue record

Date deposited: 27 Nov 2025 17:53
Last modified: 28 Nov 2025 02:36

Export record

Contributors

Author: Gyanendro Loitongbam

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×