A schema exploration approach for document-oriented data using unsupervised techniques
A schema exploration approach for document-oriented data using unsupervised techniques
For more than 40 years, relational data was the dominant force in the world of
storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or unstructured. These kinds of data are gaining popularity among database developers. For instance, the amount of document oriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse.
Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.
University of Southampton
Bawakid, Fahad
7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79
October 2019
Bawakid, Fahad
7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79
Hall, Wendy
11f7f8db-854c-4481-b1ae-721a51d8790c
Bawakid, Fahad
(2019)
A schema exploration approach for document-oriented data using unsupervised techniques.
University of Southampton, Doctoral Thesis, 153pp.
Record type:
Thesis
(Doctoral)
Abstract
For more than 40 years, relational data was the dominant force in the world of
storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or unstructured. These kinds of data are gaining popularity among database developers. For instance, the amount of document oriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse.
Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.
Text
Final thesis
- Version of Record
More information
Published date: October 2019
Identifiers
Local EPrints ID: 435796
URI: http://eprints.soton.ac.uk/id/eprint/435796
PURE UUID: ca2ed211-cefc-460b-8302-a7599d874e06
Catalogue record
Date deposited: 20 Nov 2019 17:30
Last modified: 17 Mar 2024 02:32
Export record
Contributors
Author:
Fahad Bawakid
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics