The University of Southampton
University of Southampton Institutional Repository

A schema exploration approach for document-oriented data using unsupervised techniques

A schema exploration approach for document-oriented data using unsupervised techniques
A schema exploration approach for document-oriented data using unsupervised techniques
For more than 40 years, relational data was the dominant force in the world of
storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or un-structured. These kinds of data are gaining popularity among database developers. For instance, the amount of documentoriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse.

Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.
University of Southampton
Bawakid, Fahad
7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79
Bawakid, Fahad
7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c

Bawakid, Fahad (2019) A schema exploration approach for document-oriented data using unsupervised techniques. University of Southampton, Doctoral Thesis, 153pp.

Record type: Thesis (Doctoral)

Abstract

For more than 40 years, relational data was the dominant force in the world of
storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or un-structured. These kinds of data are gaining popularity among database developers. For instance, the amount of documentoriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse.

Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.

Text
Final thesis - Version of Record
Available under License University of Southampton Thesis Licence.
Download (14MB)

More information

Published date: October 2019

Identifiers

Local EPrints ID: 435796
URI: http://eprints.soton.ac.uk/id/eprint/435796
PURE UUID: ca2ed211-cefc-460b-8302-a7599d874e06
ORCID for Wendy Hall: ORCID iD orcid.org/0000-0003-4327-7811

Catalogue record

Date deposited: 20 Nov 2019 17:30
Last modified: 21 Nov 2019 01:39

Export record

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×