Tensor-based graph modularity for text data clustering
Tensor-based graph modularity for text data clustering
Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity –TGM– outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.
Boutalbi, Rafika
a03728b9-e89a-47ab-b2d2-d1cfd943e593
Ait-Saada, Mira
e7318cc9-8748-406c-b416-bf7ce9a0feba
Iurshina, Anastasiia
953cc079-571a-41c4-84be-0c97943d4ef3
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Nadif, Mohamed
4c87f143-ea7c-4089-a69c-3db718a501f0
Boutalbi, Rafika
a03728b9-e89a-47ab-b2d2-d1cfd943e593
Ait-Saada, Mira
e7318cc9-8748-406c-b416-bf7ce9a0feba
Iurshina, Anastasiia
953cc079-571a-41c4-84be-0c97943d4ef3
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Nadif, Mohamed
4c87f143-ea7c-4089-a69c-3db718a501f0
Boutalbi, Rafika, Ait-Saada, Mira, Iurshina, Anastasiia, Staab, Steffen and Nadif, Mohamed
(2022)
Tensor-based graph modularity for text data clustering.
45th International ACM SIGIR Conference on Research and Development in Information Retrieval, , Madrid, Spain.
11 - 15 Jul 2022.
(In Press)
Record type:
Conference or Workshop Item
(Paper)
Abstract
Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity –TGM– outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.
Text
SIGIR_2022__Tensor_based_Graph_Modularity_for_Multi_view_Text_Data_Clustering - camera-ready
- Accepted Manuscript
More information
Accepted/In Press date: 2022
Venue - Dates:
45th International ACM SIGIR Conference on Research and Development in Information Retrieval, , Madrid, Spain, 2022-07-11 - 2022-07-15
Identifiers
Local EPrints ID: 458195
URI: http://eprints.soton.ac.uk/id/eprint/458195
PURE UUID: d19d554e-923d-443d-b2ab-e83c05c7b19d
Catalogue record
Date deposited: 30 Jun 2022 17:08
Last modified: 17 Mar 2024 03:38
Export record
Contributors
Author:
Rafika Boutalbi
Author:
Mira Ait-Saada
Author:
Anastasiia Iurshina
Author:
Steffen Staab
Author:
Mohamed Nadif
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics