The University of Southampton
University of Southampton Institutional Repository

GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue

GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue
GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue
Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.
Document Layout Analysis, Table Structure Recognition, Image Processing, Deep Learning, Historical documents, Measurements
Middleton, Stuart
404b62ba-d77e-476b-9775-32645b04473f
Ziomek, Juliusz
b05e7f21-70db-497c-be74-b0b54d2a4579
Middleton, Stuart
404b62ba-d77e-476b-9775-32645b04473f
Ziomek, Juliusz
b05e7f21-70db-497c-be74-b0b54d2a4579

Middleton, Stuart and Ziomek, Juliusz (2021) GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue. International Workshop on Historical Document Imaging and Processing, , Lausanne, Switzerland. 05 - 06 Sep 2021. 6 pp . (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.

Text
GloSAT_HIPS2021_12_05_2021 - Accepted Manuscript
Download (451kB)

More information

Accepted/In Press date: 1 July 2021
Venue - Dates: International Workshop on Historical Document Imaging and Processing, , Lausanne, Switzerland, 2021-09-05 - 2021-09-06
Keywords: Document Layout Analysis, Table Structure Recognition, Image Processing, Deep Learning, Historical documents, Measurements

Identifiers

Local EPrints ID: 450279
URI: http://eprints.soton.ac.uk/id/eprint/450279
PURE UUID: c228c605-baad-4d2f-9e44-a587c2b2c154
ORCID for Stuart Middleton: ORCID iD orcid.org/0000-0001-8305-8176

Catalogue record

Date deposited: 20 Jul 2021 16:31
Last modified: 21 Jul 2021 01:38

Export record

Contributors

Author: Juliusz Ziomek

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×