The University of Southampton
University of Southampton Institutional Repository

Improving searchability of datasets

Improving searchability of datasets
Improving searchability of datasets
Data is one of the most important digital assets in the world thanks to its business and social value. As is becoming increasingly available online, in order to use it effectively, we need tools that allow us to retrieve the most relevant datasets that match our information needs. Web search engines are not well suited for this task as they are designed for documents, not data. In recent years several bespoke search engines have been proposed to help with finding datasets, such as Google Dataset Search crawling the whole web or DataMed focused on creating an index of biomedical datasets. In this work we look closer into the problem of searching for data on the example of Open Data platforms. We first applied a mixed-methods approach aimed at deepening our understanding of users of Open Data portals and types of queries they issue while searching for datasets accompanied by analysis of search sessions over one of these data portals. Based on our findings we look into a particular problem of dataset interpretation - meaning of numerical columns. We propose a novel approach for assigning semantic labels to numerical columns. We conclude our work with the analysis of the future work needed in the field in order to potentially improve the searchability of datasets on the web.
University of Southampton
Kacprzak, Emilia, Magdalena
fdc38ad7-6879-4769-ad65-5d3582690af2
Kacprzak, Emilia, Magdalena
fdc38ad7-6879-4769-ad65-5d3582690af2
Ibanez Gonzalez, Luis
65a2e20b-74a9-427d-8c4c-2330285153ed

Kacprzak, Emilia, Magdalena (2022) Improving searchability of datasets. University of Southampton, Doctoral Thesis, 146pp.

Record type: Thesis (Doctoral)

Abstract

Data is one of the most important digital assets in the world thanks to its business and social value. As is becoming increasingly available online, in order to use it effectively, we need tools that allow us to retrieve the most relevant datasets that match our information needs. Web search engines are not well suited for this task as they are designed for documents, not data. In recent years several bespoke search engines have been proposed to help with finding datasets, such as Google Dataset Search crawling the whole web or DataMed focused on creating an index of biomedical datasets. In this work we look closer into the problem of searching for data on the example of Open Data platforms. We first applied a mixed-methods approach aimed at deepening our understanding of users of Open Data portals and types of queries they issue while searching for datasets accompanied by analysis of search sessions over one of these data portals. Based on our findings we look into a particular problem of dataset interpretation - meaning of numerical columns. We propose a novel approach for assigning semantic labels to numerical columns. We conclude our work with the analysis of the future work needed in the field in order to potentially improve the searchability of datasets on the web.

Text
Emilia_Kacprzak_PhD_WAIS_27_March - Version of Record
Available under License University of Southampton Thesis Licence.
Download (5MB)
Text
Permission to deposit thesis - form
Restricted to Repository staff only
Available under License University of Southampton Thesis Licence.

More information

Submitted date: March 2022

Identifiers

Local EPrints ID: 457260
URI: http://eprints.soton.ac.uk/id/eprint/457260
PURE UUID: 2bf7dc65-d906-4790-8a4b-ede58674f877
ORCID for Luis Ibanez Gonzalez: ORCID iD orcid.org/0000-0001-6993-0001

Catalogue record

Date deposited: 30 May 2022 16:34
Last modified: 17 Mar 2024 03:39

Export record

Contributors

Author: Emilia, Magdalena Kacprzak
Thesis advisor: Luis Ibanez Gonzalez ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×