Using LLMs to infer provenance information
Using LLMs to infer provenance information
Having a provenance record facilitates data reuse and experimental reuse. However, provenance capture requires either: specific provenance-enabled systems to be used or human documentation. While there have been many examples of provenance-enabled systems for scientific usage, they are still the exception, not the norm. The one, standard place for provenance information of scientific experiments remains the scientific publication. Unfortunately, provenance buried in text is not immediately useful for computational purposes. Large Language Models (LLMs) have demonstrated exceptional capability across various tasks, particularly in information extraction. In this paper, we explore the potential of LLMs to infer a provenance record for scientific experiments from scientific papers. We develop an extractor, identify the most effective prompt for provenance extraction. Our results emphasise the capability of ChatGPT-4o in accessing and extracting provenance information from biomedical research papers. Additionally, we assess the scalability of the extractor for use in extracting provenance information across a set of biomedical research papers.
Almuntashiri, Abdullah
aa118cfa-3b60-4717-9855-2816bbbb28d0
Ibáñez, Luis-Daniel
65a2e20b-74a9-427d-8c4c-2330285153ed
Chapman, Age
721b7321-8904-4be2-9b01-876c430743f1
Almuntashiri, Abdullah
aa118cfa-3b60-4717-9855-2816bbbb28d0
Ibáñez, Luis-Daniel
65a2e20b-74a9-427d-8c4c-2330285153ed
Chapman, Age
721b7321-8904-4be2-9b01-876c430743f1
Almuntashiri, Abdullah, Ibáñez, Luis-Daniel and Chapman, Age
(2025)
Using LLMs to infer provenance information.
ACM SIGMOD/PODS International Conference on Management of Data, , Berlin, Germany.
22 - 27 Jun 2025.
(In Press)
Record type:
Conference or Workshop Item
(Paper)
Abstract
Having a provenance record facilitates data reuse and experimental reuse. However, provenance capture requires either: specific provenance-enabled systems to be used or human documentation. While there have been many examples of provenance-enabled systems for scientific usage, they are still the exception, not the norm. The one, standard place for provenance information of scientific experiments remains the scientific publication. Unfortunately, provenance buried in text is not immediately useful for computational purposes. Large Language Models (LLMs) have demonstrated exceptional capability across various tasks, particularly in information extraction. In this paper, we explore the potential of LLMs to infer a provenance record for scientific experiments from scientific papers. We develop an extractor, identify the most effective prompt for provenance extraction. Our results emphasise the capability of ChatGPT-4o in accessing and extracting provenance information from biomedical research papers. Additionally, we assess the scalability of the extractor for use in extracting provenance information across a set of biomedical research papers.
This record has no associated files available for download.
More information
Accepted/In Press date: 29 April 2025
Venue - Dates:
ACM SIGMOD/PODS International Conference on Management of Data, , Berlin, Germany, 2025-06-22 - 2025-06-27
Identifiers
Local EPrints ID: 501728
URI: http://eprints.soton.ac.uk/id/eprint/501728
PURE UUID: 887b468d-6c31-4c2f-9d82-1dcd125cf5ea
Catalogue record
Date deposited: 09 Jun 2025 17:25
Last modified: 10 Jun 2025 02:06
Export record
Contributors
Author:
Abdullah Almuntashiri
Author:
Luis-Daniel Ibáñez
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics