The University of Southampton
University of Southampton Institutional Repository

Using LLMs to infer provenance information

Using LLMs to infer provenance information
Using LLMs to infer provenance information
Having a provenance record facilitates data reuse and experimental reuse. However, provenance capture requires either: specific provenance-enabled systems to be used or human documentation. While there have been many examples of provenance-enabled systems for scientific usage, they are still the exception, not the norm. The one, standard place for provenance information of scientific experiments remains the scientific publication. Unfortunately, provenance buried in text is not immediately useful for computational purposes. Large Language Models (LLMs) have demonstrated exceptional capability across various tasks, particularly in information extraction. In this paper, we explore the potential of LLMs to infer a provenance record for scientific experiments from scientific papers. We develop an extractor, identify the most effective prompt for provenance extraction. Our results emphasise the capability of ChatGPT-4o in accessing and extracting provenance information from biomedical research papers. Additionally, we assess the scalability of the extractor for use in extracting provenance information across a set of biomedical research papers.
Almuntashiri, Abdullah
aa118cfa-3b60-4717-9855-2816bbbb28d0
Ibáñez, Luis-Daniel
65a2e20b-74a9-427d-8c4c-2330285153ed
Chapman, Age
721b7321-8904-4be2-9b01-876c430743f1
Almuntashiri, Abdullah
aa118cfa-3b60-4717-9855-2816bbbb28d0
Ibáñez, Luis-Daniel
65a2e20b-74a9-427d-8c4c-2330285153ed
Chapman, Age
721b7321-8904-4be2-9b01-876c430743f1

Almuntashiri, Abdullah, Ibáñez, Luis-Daniel and Chapman, Age (2025) Using LLMs to infer provenance information. ACM SIGMOD/PODS International Conference on Management of Data, , Berlin, Germany. 22 - 27 Jun 2025. (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Having a provenance record facilitates data reuse and experimental reuse. However, provenance capture requires either: specific provenance-enabled systems to be used or human documentation. While there have been many examples of provenance-enabled systems for scientific usage, they are still the exception, not the norm. The one, standard place for provenance information of scientific experiments remains the scientific publication. Unfortunately, provenance buried in text is not immediately useful for computational purposes. Large Language Models (LLMs) have demonstrated exceptional capability across various tasks, particularly in information extraction. In this paper, we explore the potential of LLMs to infer a provenance record for scientific experiments from scientific papers. We develop an extractor, identify the most effective prompt for provenance extraction. Our results emphasise the capability of ChatGPT-4o in accessing and extracting provenance information from biomedical research papers. Additionally, we assess the scalability of the extractor for use in extracting provenance information across a set of biomedical research papers.

This record has no associated files available for download.

More information

Accepted/In Press date: 29 April 2025
Venue - Dates: ACM SIGMOD/PODS International Conference on Management of Data, , Berlin, Germany, 2025-06-22 - 2025-06-27

Identifiers

Local EPrints ID: 501728
URI: http://eprints.soton.ac.uk/id/eprint/501728
PURE UUID: 887b468d-6c31-4c2f-9d82-1dcd125cf5ea
ORCID for Abdullah Almuntashiri: ORCID iD orcid.org/0000-0002-7343-6468
ORCID for Luis-Daniel Ibáñez: ORCID iD orcid.org/0000-0001-6993-0001
ORCID for Age Chapman: ORCID iD orcid.org/0000-0002-3814-2587

Catalogue record

Date deposited: 09 Jun 2025 17:25
Last modified: 10 Jun 2025 02:06

Export record

Contributors

Author: Abdullah Almuntashiri ORCID iD
Author: Luis-Daniel Ibáñez ORCID iD
Author: Age Chapman ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×