The University of Southampton
University of Southampton Institutional Repository

Framework for automatic information extraction from research papers on nanocrystal devices

Framework for automatic information extraction from research papers on nanocrystal devices
Framework for automatic information extraction from research papers on nanocrystal devices
To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.
2190-4286
1872–1882
Dieb, Thaer M.
b480cb36-bd2c-4127-87b3-9711343b90db
Yoshioka, Masaharu
093c42f8-d0db-4236-8457-23f55d069fe3
Hara, Shinjiro
3a464bc8-b2ac-4a82-8b87-ac2f168fb531
Newton, Marcus
fac92cce-a9f3-46cd-9f58-c810f7b49c7e
Dieb, Thaer M.
b480cb36-bd2c-4127-87b3-9711343b90db
Yoshioka, Masaharu
093c42f8-d0db-4236-8457-23f55d069fe3
Hara, Shinjiro
3a464bc8-b2ac-4a82-8b87-ac2f168fb531
Newton, Marcus
fac92cce-a9f3-46cd-9f58-c810f7b49c7e

Dieb, Thaer M., Yoshioka, Masaharu, Hara, Shinjiro and Newton, Marcus (2015) Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein Journal of Nanotechnology, 6, 1872–1882. (doi:10.3762/bjnano.6.190).

Record type: Article

Abstract

To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.

This record has no associated files available for download.

More information

Accepted/In Press date: 20 August 2015
Published date: 7 September 2015

Identifiers

Local EPrints ID: 451829
URI: http://eprints.soton.ac.uk/id/eprint/451829
ISSN: 2190-4286
PURE UUID: c2cc80fa-e25b-43cf-9257-3864aca5849b
ORCID for Marcus Newton: ORCID iD orcid.org/0000-0002-4062-2117

Catalogue record

Date deposited: 29 Oct 2021 16:30
Last modified: 09 Jan 2022 03:45

Export record

Altmetrics

Contributors

Author: Thaer M. Dieb
Author: Masaharu Yoshioka
Author: Shinjiro Hara
Author: Marcus Newton ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×