The University of Southampton
University of Southampton Institutional Repository

Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of Renyi entropy

Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of Renyi entropy
Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of Renyi entropy
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.
Regularization, Renyi entropy, Topic modeling
394
Koltsov, Sergei
022812a6-4063-4263-a9f3-a5e1417cd91f
Ignatenko, Vera
0a791974-3a01-43bb-827b-ec30deead858
Boukhers, Zeyd
0768f27b-2434-442a-bf16-00264e90b3cd
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49
Koltsov, Sergei
022812a6-4063-4263-a9f3-a5e1417cd91f
Ignatenko, Vera
0a791974-3a01-43bb-827b-ec30deead858
Boukhers, Zeyd
0768f27b-2434-442a-bf16-00264e90b3cd
Staab, Steffen
bf48d51b-bd11-4d58-8e1c-4e6e03b30c49

Koltsov, Sergei, Ignatenko, Vera, Boukhers, Zeyd and Staab, Steffen (2020) Analyzing the influence of hyper-parameters and regularizers of topic modeling in terms of Renyi entropy. Entropy, 22 (4), 394, [394]. (doi:10.3390/e22040394).

Record type: Article

Abstract

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

This record has no associated files available for download.

More information

Accepted/In Press date: 25 March 2020
e-pub ahead of print date: 30 March 2020
Published date: 1 April 2020
Additional Information: Funding Information: Funding: Sergei Koltcov and Vera Ignatenko were supported by the Basic Research Program at the National Research University Higher School of Economics in 2019. Zeyd Boukhers and Steffen Staab were previously supported by the German Research Foundation (DFG) through the project grant ’Extraction of Citations from PDF Documents (EXCITE)’ under grant number STA 572/14-1. Steffen Staab is now supported by the German Research Foundation (DFG) through the project grant “Open Argument Mining” (grant number STA 572/18-1). Publisher Copyright: © 2020 by the authors. Licensee MDPI, Basel, Switzerland.
Keywords: Regularization, Renyi entropy, Topic modeling

Identifiers

Local EPrints ID: 438982
URI: http://eprints.soton.ac.uk/id/eprint/438982
PURE UUID: e8b5c7a9-b3d9-44e1-8f3f-fa22ae3ec1a6
ORCID for Steffen Staab: ORCID iD orcid.org/0000-0002-0780-4154

Catalogue record

Date deposited: 31 Mar 2020 16:30
Last modified: 06 Jun 2024 01:54

Export record

Altmetrics

Contributors

Author: Sergei Koltsov
Author: Vera Ignatenko
Author: Zeyd Boukhers
Author: Steffen Staab ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×