The University of Southampton
University of Southampton Institutional Repository

Text simplification using transformer and BERT

Text simplification using transformer and BERT
Text simplification using transformer and BERT
Reading and writing are the main interaction methods with web content. Text simplification tools are helpful for people with cognitive impairments, new language learners, and children as they might find difficulties in understanding the complex web content. Text simplification is the process of changing complex text into more readable and understandable text. The recent approaches to text simplification adopted the machine translation concept to learn simplification rules from a parallel corpus of complex and simple sentences. In this paper, we propose two models based on the transformer which is an encoder-decoder structure that achieves state-of-the-art (SOTA) results in machine translation. The training process for our model includes three steps: preprocessing the data using a subword tokenizer, training the model and optimizing it using the Adam optimizer, then using the model to decode the output. The first model uses the transformer only and the second model uses and integrates the Bidirectional Encoder Representations from Transformer (BERT) as encoder to enhance the training time and results. The performance of the proposed model using the transformer was evaluated using the Bilingual Evaluation Understudy score (BLEU) and recorded (53.78) on the WikiSmall dataset. On the other hand, the experiment on the second model which is integrated with BERT shows that the validation loss decreased very fast compared with the model without the BERT. However, the BLEU score was small (44.54), which could be due to the size of the dataset so the model was overfitting and unable to generalize well. Therefore, in the future, the second model could involve experimenting with a larger dataset such as the WikiLarge. In addition, more analysis has been done on the model’s results and the used dataset using different evaluation metrics to understand their performance.
text simplification; neural machine translation; transformer, Text simplification, transformer, neural machine translation
3479–3495
Alissa, Sarah
bb844098-19bc-47fe-a136-c96d4e8273f6
Wald, Michael
90577cfd-35ae-4e4a-9422-5acffecd89d5
Alissa, Sarah
bb844098-19bc-47fe-a136-c96d4e8273f6
Wald, Michael
90577cfd-35ae-4e4a-9422-5acffecd89d5

Alissa, Sarah and Wald, Michael (2023) Text simplification using transformer and BERT. Computers, Materials & Continua, 75 (2), 3479–3495. (doi:10.32604/cmc.2023.033647).

Record type: Article

Abstract

Reading and writing are the main interaction methods with web content. Text simplification tools are helpful for people with cognitive impairments, new language learners, and children as they might find difficulties in understanding the complex web content. Text simplification is the process of changing complex text into more readable and understandable text. The recent approaches to text simplification adopted the machine translation concept to learn simplification rules from a parallel corpus of complex and simple sentences. In this paper, we propose two models based on the transformer which is an encoder-decoder structure that achieves state-of-the-art (SOTA) results in machine translation. The training process for our model includes three steps: preprocessing the data using a subword tokenizer, training the model and optimizing it using the Adam optimizer, then using the model to decode the output. The first model uses the transformer only and the second model uses and integrates the Bidirectional Encoder Representations from Transformer (BERT) as encoder to enhance the training time and results. The performance of the proposed model using the transformer was evaluated using the Bilingual Evaluation Understudy score (BLEU) and recorded (53.78) on the WikiSmall dataset. On the other hand, the experiment on the second model which is integrated with BERT shows that the validation loss decreased very fast compared with the model without the BERT. However, the BLEU score was small (44.54), which could be due to the size of the dataset so the model was overfitting and unable to generalize well. Therefore, in the future, the second model could involve experimenting with a larger dataset such as the WikiLarge. In addition, more analysis has been done on the model’s results and the used dataset using different evaluation metrics to understand their performance.

Text
TSP_CMC_33647 - Version of Record
Available under License Creative Commons Attribution.
Download (325kB)

More information

Accepted/In Press date: 6 February 2023
e-pub ahead of print date: 31 March 2023
Published date: 2023
Additional Information: Publisher Copyright: © 2023 Tech Science Press. All rights reserved.
Keywords: text simplification; neural machine translation; transformer, Text simplification, transformer, neural machine translation

Identifiers

Local EPrints ID: 477287
URI: http://eprints.soton.ac.uk/id/eprint/477287
PURE UUID: 8fd08e7a-1102-4538-b5ae-ad332885ec9a

Catalogue record

Date deposited: 02 Jun 2023 16:35
Last modified: 17 Mar 2024 01:35

Export record

Altmetrics

Contributors

Author: Sarah Alissa
Author: Michael Wald

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×