Using LLMs to infer provenance information

Almuntashiri, Abdullah, Ibáñez, Luis-Daniel and Chapman, Age (2025) Using LLMs to infer provenance information. ACM SIGMOD/PODS International Conference on Management of Data, , Berlin, Germany. 22 - 27 Jun 2025. (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Having a provenance record facilitates data reuse and experimental reuse. However, provenance capture requires either: specific provenance-enabled systems to be used or human documentation. While there have been many examples of provenance-enabled systems for scientific usage, they are still the exception, not the norm. The one, standard place for provenance information of scientific experiments remains the scientific publication. Unfortunately, provenance buried in text is not immediately useful for computational purposes. Large Language Models (LLMs) have demonstrated exceptional capability across various tasks, particularly in information extraction. In this paper, we explore the potential of LLMs to infer a provenance record for scientific experiments from scientific papers. We develop an extractor, identify the most effective prompt for provenance extraction. Our results emphasise the capability of ChatGPT-4o in accessing and extracting provenance information from biomedical research papers. Additionally, we assess the scalability of the extractor for use in extracting provenance information across a set of biomedical research papers.

This record has no associated files available for download.