GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue
GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue
Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.
Document Layout Analysis, Table Structure Recognition, Image Processing, Deep Learning, Historical documents, Measurements
Middleton, Stuart
404b62ba-d77e-476b-9775-32645b04473f
Ziomek, Juliusz
b05e7f21-70db-497c-be74-b0b54d2a4579
5 September 2021
Middleton, Stuart
404b62ba-d77e-476b-9775-32645b04473f
Ziomek, Juliusz
b05e7f21-70db-497c-be74-b0b54d2a4579
Middleton, Stuart and Ziomek, Juliusz
(2021)
GloSAT Historical Measurement Table Dataset: Enhanced table structure recognition annotation for downstream historical data rescue.
International Workshop on Historical Document Imaging and Processing, , Lausanne, Switzerland.
05 - 06 Sep 2021.
6 pp
.
Record type:
Conference or Workshop Item
(Paper)
Abstract
Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.
Text
GloSAT_HIPS2021_12_05_2021
- Accepted Manuscript
More information
Accepted/In Press date: 1 July 2021
Published date: 5 September 2021
Venue - Dates:
International Workshop on Historical Document Imaging and Processing, , Lausanne, Switzerland, 2021-09-05 - 2021-09-06
Keywords:
Document Layout Analysis, Table Structure Recognition, Image Processing, Deep Learning, Historical documents, Measurements
Identifiers
Local EPrints ID: 450279
URI: http://eprints.soton.ac.uk/id/eprint/450279
PURE UUID: c228c605-baad-4d2f-9e44-a587c2b2c154
Catalogue record
Date deposited: 20 Jul 2021 16:31
Last modified: 17 Mar 2024 02:52
Export record
Contributors
Author:
Juliusz Ziomek
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics