The University of Southampton
University of Southampton Institutional Repository

A sequence modeling approach for structured data extraction from unstructured text

A sequence modeling approach for structured data extraction from unstructured text
A sequence modeling approach for structured data extraction from unstructured text

Extraction of structured information from unstructured text has always been a problem of interest for NLP community. Structured data is concise to store, search and retrieve; and it facilitates easier human & machine consumption. Traditionally, structured data extraction from text has been done by using various parsing methodologies, applying domain specific rules and heuristics. In this work, we leverage the developments in the space of sequence modeling for the problem of structured data extraction. Initially, we posed the problem as a machine translation problem and used the state-of-the-art machine translation model. Based on these initial results, we changed the approach to a sequence tagging one. We propose an extension of one of the attractive models for sequence tagging tailored and effective to our problem. This gave 4.4% improvement over the vanilla sequence tagging model. We also propose another variant of the sequence tagging model which can handle multiple labels of words. Experiments have been performed on Wikipedia Infobox Dataset of biographies and results are presented for both single and multi-label models. These models indicate an effective alternate deep learning technique based methods to extract structured data from raw text.

41-50
Association for Computational Linguistics (ACL)
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Annervaz, K. M.
60ecdbb0-0673-49ca-92d4-29e48a46a0bb
Sengupta, Shubhashis
b7c8401f-33ff-4edc-89cf-228aa902a6cc
Deshmukh, Jayati
5903b0c1-b4d1-4fbf-b687-610d4fde3990
Annervaz, K. M.
60ecdbb0-0673-49ca-92d4-29e48a46a0bb
Sengupta, Shubhashis
b7c8401f-33ff-4edc-89cf-228aa902a6cc

Deshmukh, Jayati, Annervaz, K. M. and Sengupta, Shubhashis (2019) A sequence modeling approach for structured data extraction from unstructured text. In IJCAI 2019 - Proceedings of the 5th Workshop on Semantic Deep Learning, SemDeep 2019. Association for Computational Linguistics (ACL). pp. 41-50 .

Record type: Conference or Workshop Item (Paper)

Abstract

Extraction of structured information from unstructured text has always been a problem of interest for NLP community. Structured data is concise to store, search and retrieve; and it facilitates easier human & machine consumption. Traditionally, structured data extraction from text has been done by using various parsing methodologies, applying domain specific rules and heuristics. In this work, we leverage the developments in the space of sequence modeling for the problem of structured data extraction. Initially, we posed the problem as a machine translation problem and used the state-of-the-art machine translation model. Based on these initial results, we changed the approach to a sequence tagging one. We propose an extension of one of the attractive models for sequence tagging tailored and effective to our problem. This gave 4.4% improvement over the vanilla sequence tagging model. We also propose another variant of the sequence tagging model which can handle multiple labels of words. Experiments have been performed on Wikipedia Infobox Dataset of biographies and results are presented for both single and multi-label models. These models indicate an effective alternate deep learning technique based methods to extract structured data from raw text.

This record has no associated files available for download.

More information

Published date: 1 January 2019
Additional Information: Publisher Copyright: © IJCAI 2019 - Proceedings of the 5th Workshop on Semantic Deep Learning, SemDeep 2019. All rights reserved.
Venue - Dates: 5th Workshop on Semantic Deep Learning, SemDeep 2019, held in conjunction with IJCAI 2019, , Macau, China, 2019-08-12

Identifiers

Local EPrints ID: 493208
URI: http://eprints.soton.ac.uk/id/eprint/493208
PURE UUID: a661303f-dfb6-4fb3-a93d-4c07b93074f9
ORCID for Jayati Deshmukh: ORCID iD orcid.org/0000-0002-1144-2635

Catalogue record

Date deposited: 27 Aug 2024 17:31
Last modified: 28 Aug 2024 02:16

Export record

Contributors

Author: Jayati Deshmukh ORCID iD
Author: K. M. Annervaz
Author: Shubhashis Sengupta

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×