A schema exploration approach for document-oriented data using unsupervised techniques

For more than 40 years, relational data was the dominant force in the world of
storing and managing data. However, the complex and strict data model provided by the relational data then started to lose some ground, especially when the data requirements of applications frequently change, thus requiring a flexible data model. Contemporary applications favour less strict data models that could be described as semi-structured or unstructured. These kinds of data are gaining popularity among database developers. For instance, the amount of document oriented data available on the Web is rising as time passes. A great portion of this data comes from Web APIs. Although document-oriented data is widely used, it cannot be easily consumed or analysed if it lacks description (i.e., metadata) or schemas that explain their internal structure. Schemas are used to make datasets more understandable and easier to query and analyse.

Based on our literature review, we found out that current initiatives and tools available for JSON documents in particular, do not provide comprehensive summaries that explore the internal structure of the documents. Our research builds on current work on JSON schema inference by addressing specified research gaps related to extracting accurate and explicit summaries of the internal structure of the JSON documents. This research aims to provide structural summaries to help in understanding JSON documents by explicitly extracting schemas and analysing their attributes. Our approach firstly infers all attributes from a JSON dataset and then applies a clustering algorithm to the documents in order to identify unique schemas; finally, based on the resulting schemas from the clustering process, we apply statistical analysis on the attributes to generate useful summaries by detecting common and schema-specific attributes. The approach is proved and evaluated through real-world and synthetic data collected from the Web in different domains.

University of Southampton

Bawakid, Fahad

7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79

October 2019

Bawakid, Fahad

7d1d34fd-4ec3-41f9-ac24-1f38c5bf1e79

Hall, Wendy

11f7f8db-854c-4481-b1ae-721a51d8790c

Bawakid, Fahad (2019) A schema exploration approach for document-oriented data using unsupervised techniques. University of Southampton, Doctoral Thesis, 153pp.

Record type: Thesis (Doctoral)