The University of Southampton
University of Southampton Institutional Repository

Tensor-based graph modularity for text data clustering

Tensor-based graph modularity for text data clustering
Tensor-based graph modularity for text data clustering
Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity –TGM– outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.
Boutalbi, Rafika
a03728b9-e89a-47ab-b2d2-d1cfd943e593
Ait-Saada, Mira
e7318cc9-8748-406c-b416-bf7ce9a0feba
Iurshina, Anastasiia
953cc079-571a-41c4-84be-0c97943d4ef3
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Nadif, Mohamed
4c87f143-ea7c-4089-a69c-3db718a501f0
Boutalbi, Rafika
a03728b9-e89a-47ab-b2d2-d1cfd943e593
Ait-Saada, Mira
e7318cc9-8748-406c-b416-bf7ce9a0feba
Iurshina, Anastasiia
953cc079-571a-41c4-84be-0c97943d4ef3
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Nadif, Mohamed
4c87f143-ea7c-4089-a69c-3db718a501f0

Boutalbi, Rafika, Ait-Saada, Mira, Iurshina, Anastasiia, Staab, Steffen and Nadif, Mohamed (2022) Tensor-based graph modularity for text data clustering. 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, , Madrid, Spain. 11 - 15 Jul 2022. (In Press)

Record type: Conference or Workshop Item (Paper)

Abstract

Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity –TGM– outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.

Text
SIGIR_2022__Tensor_based_Graph_Modularity_for_Multi_view_Text_Data_Clustering - camera-ready - Accepted Manuscript
Download (2MB)

More information

Accepted/In Press date: 2022
Venue - Dates: 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, , Madrid, Spain, 2022-07-11 - 2022-07-15

Identifiers

Local EPrints ID: 458195
URI: http://eprints.soton.ac.uk/id/eprint/458195
PURE UUID: d19d554e-923d-443d-b2ab-e83c05c7b19d
ORCID for Steffen Staab: ORCID iD orcid.org/0000-0002-0780-4154

Catalogue record

Date deposited: 30 Jun 2022 17:08
Last modified: 24 Nov 2022 02:47

Export record

Contributors

Author: Rafika Boutalbi
Author: Mira Ait-Saada
Author: Anastasiia Iurshina
Author: Steffen Staab ORCID iD
Author: Mohamed Nadif

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×