An Investigation into the performance of regular expressions within SPARQL query language

Aljaloud, Saud (2019) An Investigation into the performance of regular expressions within SPARQL query language. University of Southampton, Doctoral Thesis, 170pp.

Record type: Thesis (Doctoral)

Abstract

SPARQL has not simply been the standard querying language for the Resource Description Framework (RDF) within the Semantic Web, but it has also gradually become one of the main querying languages for the graph model, in general. To be able to process SPARQL in a more efficient manner, an RDF store (as a DBMS) has to be used. However, SPARQL faces huge performance challenges for various reasons: the high flexibility of RDF model, the fact that the SPARQL standardisation does not always focus on the performance side, or the immaturity of RDF and SPARQL in comparison to some other models such as SQL.

One of SPARQL features is the ability to search through literals/strings by using a Regular Expression (Regex) filter. This adds a very handy and expressive utility, which allows users to search through strings or filter certain URIs. However, Regex is computationally expensive as well as resource intensive in that, for example, data has to be loaded into the memory.

This thesis aims to investigate the performance of Regex within SPARQL. Firstly, we propose an analysis of the way people use Regex within SPARQL by looking at a huge log of queries made available provided by various RDF store providers. The analysis indicates various use cases in which their performance can be made more efficient. There is very little in the literature to adequately test the performance of Regex within SPARQL. We also propose the first Regex-Specific benchmark, named (BSBMstr) to be applied to the area of SPARQL. BSBMstr shows how various Regex features affect the overall performance of the SPARQL queries. BSBMstr also reports its results on seven known RDF stores.

SPARQL benchmarks, in general, have been a major field that attracts much research in the area of the Semantic Web. Nevertheless, many have argued that there are still issues in their design or simulation of real-world scenarios. This thesis also proposes a generic SPARQL benchmark, named CBSBench which introduces a new design of benchmarks. Unlike other benchmarks, CBSBench measures the performance of clusters rather than fixed queries. The usage of clusters also provides a stress test on RDF stores, because of the diversity of queries within each cluster. the CBSBench results are also reported on very different RDF stores.

Finally, the thesis introduces (RegjInd)ex which is a Regex index data structure that is based on a tri-grams inverted index. This index aims to reduce the result sets to be scanned to match a Regex filter within SPARQL. The proposal has been evaluated by two different Regex-specific benchmarks and implemented on top of two RDF stores. (RegjInd)ex produces a smaller index size compared to previous work, while still being able to produce results faster than the original implementations by up to an order of magnitude.

In general, the thesis provide a general guidelines that can be followed by developers to investigate similar features within a given DBMS. The investigation mainly relies on real-world usage by analysing how people are using these features. From that analysis, developers can construct queries and features alongside our proposed benchmarks to run tests on their chosen subject. The thesis also discusses various ideas and techniques that can be used to enhance the performance of DBMSs.

Text

Final Thesis - Version of Record

Available under License University of Southampton Thesis Licence.

Download (2MB)