The University of Southampton
University of Southampton Institutional Repository

An end-to-end approach for extracting and segmenting high-variance references from PDF documents

An end-to-end approach for extracting and segmenting high-variance references from PDF documents
An end-to-end approach for extracting and segmenting high-variance references from PDF documents
This paper addresses the problem of extracting and segmenting
references from PDF documents. The novelty of the presented approach
lies in its capability to discover highly varying references
mainly in terms of content, length and location in the document.
Unlike existing works, the proposed method does not follow the
classical pipeline that consists of sequential phases. It rather learns
the different characteristics of references to be used in a coherent
scheme that reduces the error accumulation by following a
probabilistic approach. Contrary to conventional references, mentioning
the sources of information in some publications, such as
those of social science, is not subject to the same specifications
such as being located in a unique reference section. Therefore, the
proposed method aims to extract references of highly varying reference
characteristics by relaxing the restrictions of existing methods.
Additionally, we present in this paper a new challenging dataset of
annotated references in German social science publications. The
main purpose of this work is to serve the indexation of missing references
by extracting them from challenging publications such as
those of German social science. The effectiveness of the presented
methods in terms of both extraction and segmentation is evaluated
on different datasets, including the German social science set
Boukhers, Zeyd
0768f27b-2434-442a-bf16-00264e90b3cd
Ambhore, Shriharsh
a8a379e0-a5b4-44c3-b631-c7dc1bbc70ad
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Boukhers, Zeyd
0768f27b-2434-442a-bf16-00264e90b3cd
Ambhore, Shriharsh
a8a379e0-a5b4-44c3-b631-c7dc1bbc70ad
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49

Boukhers, Zeyd, Ambhore, Shriharsh and Staab, Steffen (2019) An end-to-end approach for extracting and segmenting high-variance references from PDF documents. ACM/IEEE Joint Conference on Digital Libraries, Urbana-Champaign, Illinois, United States. 02 - 06 Jun 2019. 10 pp . (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

This paper addresses the problem of extracting and segmenting
references from PDF documents. The novelty of the presented approach
lies in its capability to discover highly varying references
mainly in terms of content, length and location in the document.
Unlike existing works, the proposed method does not follow the
classical pipeline that consists of sequential phases. It rather learns
the different characteristics of references to be used in a coherent
scheme that reduces the error accumulation by following a
probabilistic approach. Contrary to conventional references, mentioning
the sources of information in some publications, such as
those of social science, is not subject to the same specifications
such as being located in a unique reference section. Therefore, the
proposed method aims to extract references of highly varying reference
characteristics by relaxing the restrictions of existing methods.
Additionally, we present in this paper a new challenging dataset of
annotated references in German social science publications. The
main purpose of this work is to serve the indexation of missing references
by extracting them from challenging publications such as
those of German social science. The effectiveness of the presented
methods in terms of both extraction and segmentation is evaluated
on different datasets, including the German social science set

Text
BoukhersJCDL2019 - Accepted Manuscript
Download (944kB)

More information

Accepted/In Press date: 11 March 2019
Venue - Dates: ACM/IEEE Joint Conference on Digital Libraries, Urbana-Champaign, Illinois, United States, 2019-06-02 - 2019-06-06

Identifiers

Local EPrints ID: 430836
URI: https://eprints.soton.ac.uk/id/eprint/430836
PURE UUID: 300790cd-7302-4674-9f63-5c95c551cdda
ORCID for Steffen Staab: ORCID iD orcid.org/0000-0002-0780-4154

Catalogue record

Date deposited: 15 May 2019 16:30
Last modified: 16 May 2019 00:30

Export record

Contributors

Author: Zeyd Boukhers
Author: Shriharsh Ambhore
Author: Steffen Staab ORCID iD

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×