The University of Southampton
University of Southampton Institutional Repository

Evaluating text classification: a benchmark study

Evaluating text classification: a benchmark study
Evaluating text classification: a benchmark study
This paper presents an impartial and extensive benchmark for text classification involving five different text classification tasks, 20 datasets, 11 different model architectures, and 42,800 al gorithm runs. The five text classification tasks are fake news classification, topic detection, emotion detection, polarity detection, and sarcasm detection. While in practice, especially in Natural Language Processing (NLP), research tends to focus on the most sophisticated models, we hypothesize that this is not always necessary. Therefore, our main objective is to investi gate whether the largest state-of-the-art (SOTA) models are always preferred, or in what cases simple methods can compete with complex models, i.e. for which dataset specifications and classification tasks. We assess the performance of different methods with varying complexity, ranging from simple statistical and machine learning methods to pretrained transformers like robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pretrain ing approach (RoBERTa). This comprehensive benchmark is lacking in existing literature, with research mainly comparing similar types of methods. Furthermore, with increasing awareness of the ecological impacts of extensive computational resource usage, this comparison is both critical and timely. We find that overall, bidirectional long short-term memory (LSTM) net works are ranked as the best-performing method albeit not statistically significantly better than logistic regression and RoBERTa. Overall, we cannot conclude that simple methods perform worse although this depends mainly on the classification task. Concretely, we find that for fake news classification and topic detection, simple techniques are the best-ranked models and consequently, it is not necessary to train complicated neural network architectures for these classification tasks. Moreover, we also find a negative correlation between F1 performance and complexity for the smallest datasets (with dataset size less than 10,000). Finally, the different models’ results are analyzed in depth to explain the model decisions, which is an increasing requirement in the field of text classification.
0957-4174
Reusens, Manon
4264e5fa-ed9c-4446-ae74-a4248ae94a49
Stevens, Alexander
9e99697d-0d65-4566-8ded-7cfbf1ff4d54
Tonglet, Jonathan
4f72888a-9922-41e5-b8c0-c2ad5c68e0df
De Smedt, Johannes
9347f721-ef76-4dd1-b2bb-d28816b6328c
Verbeke, Wouter
57c0d98a-130a-4202-b6dd-cdc6914f4732
Broucke, Seppe vanden
0b17d31c-7378-4aa6-a1a8-715ddd08b3b5
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Reusens, Manon
4264e5fa-ed9c-4446-ae74-a4248ae94a49
Stevens, Alexander
9e99697d-0d65-4566-8ded-7cfbf1ff4d54
Tonglet, Jonathan
4f72888a-9922-41e5-b8c0-c2ad5c68e0df
De Smedt, Johannes
9347f721-ef76-4dd1-b2bb-d28816b6328c
Verbeke, Wouter
57c0d98a-130a-4202-b6dd-cdc6914f4732
Broucke, Seppe vanden
0b17d31c-7378-4aa6-a1a8-715ddd08b3b5
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0

Reusens, Manon, Stevens, Alexander, Tonglet, Jonathan, De Smedt, Johannes, Verbeke, Wouter, Broucke, Seppe vanden and Baesens, Bart (2024) Evaluating text classification: a benchmark study. Expert Systems with Applications. (In Press)

Record type: Article

Abstract

This paper presents an impartial and extensive benchmark for text classification involving five different text classification tasks, 20 datasets, 11 different model architectures, and 42,800 al gorithm runs. The five text classification tasks are fake news classification, topic detection, emotion detection, polarity detection, and sarcasm detection. While in practice, especially in Natural Language Processing (NLP), research tends to focus on the most sophisticated models, we hypothesize that this is not always necessary. Therefore, our main objective is to investi gate whether the largest state-of-the-art (SOTA) models are always preferred, or in what cases simple methods can compete with complex models, i.e. for which dataset specifications and classification tasks. We assess the performance of different methods with varying complexity, ranging from simple statistical and machine learning methods to pretrained transformers like robustly optimized BERT (Bidirectional Encoder Representations from Transformers) pretrain ing approach (RoBERTa). This comprehensive benchmark is lacking in existing literature, with research mainly comparing similar types of methods. Furthermore, with increasing awareness of the ecological impacts of extensive computational resource usage, this comparison is both critical and timely. We find that overall, bidirectional long short-term memory (LSTM) net works are ranked as the best-performing method albeit not statistically significantly better than logistic regression and RoBERTa. Overall, we cannot conclude that simple methods perform worse although this depends mainly on the classification task. Concretely, we find that for fake news classification and topic detection, simple techniques are the best-ranked models and consequently, it is not necessary to train complicated neural network architectures for these classification tasks. Moreover, we also find a negative correlation between F1 performance and complexity for the smallest datasets (with dataset size less than 10,000). Finally, the different models’ results are analyzed in depth to explain the model decisions, which is an increasing requirement in the field of text classification.

Text
Paper 1 Evaluating Text Classification A Benchmark Study - Accepted Manuscript
Restricted to Repository staff only until 21 May 2026.
Request a copy

More information

Accepted/In Press date: 21 May 2024

Identifiers

Local EPrints ID: 490601
URI: http://eprints.soton.ac.uk/id/eprint/490601
ISSN: 0957-4174
PURE UUID: d1320468-90d8-428f-a846-6756ffdef89f
ORCID for Bart Baesens: ORCID iD orcid.org/0000-0002-5831-5668

Catalogue record

Date deposited: 31 May 2024 16:32
Last modified: 01 Jun 2024 01:38

Export record

Contributors

Author: Manon Reusens
Author: Alexander Stevens
Author: Jonathan Tonglet
Author: Johannes De Smedt
Author: Wouter Verbeke
Author: Seppe vanden Broucke
Author: Bart Baesens ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×