The University of Southampton
University of Southampton Institutional Repository

An Investigation into the performance of regular expressions within SPARQL query language

An Investigation into the performance of regular expressions within SPARQL query language
An Investigation into the performance of regular expressions within SPARQL query language
SPARQL has not simply been the standard querying language for the Resource Description Framework (RDF) within the Semantic Web, but it has also gradually become one of the main querying languages for the graph model, in general. To be able to process SPARQL in a more efficient manner, an RDF store (as a DBMS) has to be used. However, SPARQL faces huge performance challenges for various reasons: the high flexibility of RDF model, the fact that the SPARQL standardisation does not always focus on the performance side, or the immaturity of RDF and SPARQL in comparison to some other models such as SQL.

One of SPARQL features is the ability to search through literals/strings by using a Regular Expression (Regex) filter. This adds a very handy and expressive utility, which allows users to search through strings or filter certain URIs. However, Regex is computationally expensive as well as resource intensive in that, for example, data has to be loaded into the memory.

This thesis aims to investigate the performance of Regex within SPARQL. Firstly, we propose an analysis of the way people use Regex within SPARQL by looking at a huge log of queries made available provided by various RDF store providers. The analysis indicates various use cases in which their performance can be made more efficient. There is very little in the literature to adequately test the performance of Regex within SPARQL. We also propose the first Regex-Specific benchmark, named (BSBMstr) to be applied to the area of SPARQL. BSBMstr shows how various Regex features affect the overall performance of the SPARQL queries. BSBMstr also reports its results on seven known RDF stores.

SPARQL benchmarks, in general, have been a major field that attracts much research in the area of the Semantic Web. Nevertheless, many have argued that there are still issues in their design or simulation of real-world scenarios. This thesis also proposes a generic SPARQL benchmark, named CBSBench which introduces a new design of benchmarks. Unlike other benchmarks, CBSBench measures the performance of clusters rather than fixed queries. The usage of clusters also provides a stress test on RDF stores, because of the diversity of queries within each cluster. the CBSBench results are also reported on very different RDF stores.

Finally, the thesis introduces (RegjInd)ex which is a Regex index data structure that is based on a tri-grams inverted index. This index aims to reduce the result sets to be scanned to match a Regex filter within SPARQL. The proposal has been evaluated by two different Regex-specific benchmarks and implemented on top of two RDF stores. (RegjInd)ex produces a smaller index size compared to previous work, while still being able to produce results faster than the original implementations by up to an order of magnitude.

In general, the thesis provide a general guidelines that can be followed by developers to investigate similar features within a given DBMS. The investigation mainly relies on real-world usage by analysing how people are using these features. From that analysis, developers can construct queries and features alongside our proposed benchmarks to run tests on their chosen subject. The thesis also discusses various ideas and techniques that can be used to enhance the performance of DBMSs.
University of Southampton
Aljaloud, Saud
11381d3c-d124-4cb5-ac21-2cfb57f5f8a2
Aljaloud, Saud
11381d3c-d124-4cb5-ac21-2cfb57f5f8a2
Gibbins, Nicholas
98efd447-4aa7-411c-86d1-955a612eceac

Aljaloud, Saud (2019) An Investigation into the performance of regular expressions within SPARQL query language. University of Southampton, Doctoral Thesis, 170pp.

Record type: Thesis (Doctoral)

Abstract

SPARQL has not simply been the standard querying language for the Resource Description Framework (RDF) within the Semantic Web, but it has also gradually become one of the main querying languages for the graph model, in general. To be able to process SPARQL in a more efficient manner, an RDF store (as a DBMS) has to be used. However, SPARQL faces huge performance challenges for various reasons: the high flexibility of RDF model, the fact that the SPARQL standardisation does not always focus on the performance side, or the immaturity of RDF and SPARQL in comparison to some other models such as SQL.

One of SPARQL features is the ability to search through literals/strings by using a Regular Expression (Regex) filter. This adds a very handy and expressive utility, which allows users to search through strings or filter certain URIs. However, Regex is computationally expensive as well as resource intensive in that, for example, data has to be loaded into the memory.

This thesis aims to investigate the performance of Regex within SPARQL. Firstly, we propose an analysis of the way people use Regex within SPARQL by looking at a huge log of queries made available provided by various RDF store providers. The analysis indicates various use cases in which their performance can be made more efficient. There is very little in the literature to adequately test the performance of Regex within SPARQL. We also propose the first Regex-Specific benchmark, named (BSBMstr) to be applied to the area of SPARQL. BSBMstr shows how various Regex features affect the overall performance of the SPARQL queries. BSBMstr also reports its results on seven known RDF stores.

SPARQL benchmarks, in general, have been a major field that attracts much research in the area of the Semantic Web. Nevertheless, many have argued that there are still issues in their design or simulation of real-world scenarios. This thesis also proposes a generic SPARQL benchmark, named CBSBench which introduces a new design of benchmarks. Unlike other benchmarks, CBSBench measures the performance of clusters rather than fixed queries. The usage of clusters also provides a stress test on RDF stores, because of the diversity of queries within each cluster. the CBSBench results are also reported on very different RDF stores.

Finally, the thesis introduces (RegjInd)ex which is a Regex index data structure that is based on a tri-grams inverted index. This index aims to reduce the result sets to be scanned to match a Regex filter within SPARQL. The proposal has been evaluated by two different Regex-specific benchmarks and implemented on top of two RDF stores. (RegjInd)ex produces a smaller index size compared to previous work, while still being able to produce results faster than the original implementations by up to an order of magnitude.

In general, the thesis provide a general guidelines that can be followed by developers to investigate similar features within a given DBMS. The investigation mainly relies on real-world usage by analysing how people are using these features. From that analysis, developers can construct queries and features alongside our proposed benchmarks to run tests on their chosen subject. The thesis also discusses various ideas and techniques that can be used to enhance the performance of DBMSs.

Text
Final Thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (2MB)

More information

Published date: January 2019

Identifiers

Local EPrints ID: 428044
URI: http://eprints.soton.ac.uk/id/eprint/428044
PURE UUID: 0cea0091-9c57-4453-a6c7-8dc56e181876
ORCID for Saud Aljaloud: ORCID iD orcid.org/0000-0002-7468-2257
ORCID for Nicholas Gibbins: ORCID iD orcid.org/0000-0002-6140-9956

Catalogue record

Date deposited: 07 Feb 2019 17:30
Last modified: 16 Mar 2024 07:33

Export record

Contributors

Author: Saud Aljaloud ORCID iD
Thesis advisor: Nicholas Gibbins ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×