The University of Southampton
University of Southampton Institutional Repository

Representation transfer and data cleaning in multi-views for text simplification

Representation transfer and data cleaning in multi-views for text simplification
Representation transfer and data cleaning in multi-views for text simplification

Representation transfer is a widely used technique in natural language processing. We propose methods of cleaning the dominant dataset of text simplification (TS) WikiLarge in multi-views to remove errors that impact model training and fine-tuning. The results show that our method can effectively refine the dataset. We propose to take the pre-trained text representations from a similar task (e.g., text summarization) to text simplification to conduct a continue-fine-tuning strategy to improve the performance of pre-trained models on TS. This approach will speed up the training and make the model convergence easier. Besides, we also propose a new decoding strategy for simple text generation. It is able to generate simpler and more comprehensible text with controllable lexical simplicity. The experimental results show that our method can achieve good performance on many evaluation metrics.

Data cleaning, Decoding, Pre-trained language model, Sentence representation, Text simplification
0167-8655
40-46
He, Wei
8ca42b4c-b746-42ff-b9ec-43c92d89581a
Farrahi, Katayoun
bc848b9c-fc32-475c-b241-f6ade8babacb
Chen, Bin
c57720bd-1de9-4f03-9f30-f740d9efe876
Peng, Bohua
2b9bff20-ab84-495d-8275-dcefb645dae1
Villavicencio, Aline
edf8c965-a3ab-4674-9192-5353d7b9ef38
He, Wei
8ca42b4c-b746-42ff-b9ec-43c92d89581a
Farrahi, Katayoun
bc848b9c-fc32-475c-b241-f6ade8babacb
Chen, Bin
c57720bd-1de9-4f03-9f30-f740d9efe876
Peng, Bohua
2b9bff20-ab84-495d-8275-dcefb645dae1
Villavicencio, Aline
edf8c965-a3ab-4674-9192-5353d7b9ef38

He, Wei, Farrahi, Katayoun, Chen, Bin, Peng, Bohua and Villavicencio, Aline (2023) Representation transfer and data cleaning in multi-views for text simplification. Pattern Recognition Letters, 177, 40-46. (doi:10.1016/j.patrec.2023.11.011).

Record type: Article

Abstract

Representation transfer is a widely used technique in natural language processing. We propose methods of cleaning the dominant dataset of text simplification (TS) WikiLarge in multi-views to remove errors that impact model training and fine-tuning. The results show that our method can effectively refine the dataset. We propose to take the pre-trained text representations from a similar task (e.g., text summarization) to text simplification to conduct a continue-fine-tuning strategy to improve the performance of pre-trained models on TS. This approach will speed up the training and make the model convergence easier. Besides, we also propose a new decoding strategy for simple text generation. It is able to generate simpler and more comprehensible text with controllable lexical simplicity. The experimental results show that our method can achieve good performance on many evaluation metrics.

Text
1-s2.0-S0167865523003215-main - Version of Record
Available under License Creative Commons Attribution.
Download (936kB)

More information

Accepted/In Press date: 8 November 2023
e-pub ahead of print date: 10 November 2023
Published date: 5 December 2023
Keywords: Data cleaning, Decoding, Pre-trained language model, Sentence representation, Text simplification

Identifiers

Local EPrints ID: 489975
URI: http://eprints.soton.ac.uk/id/eprint/489975
ISSN: 0167-8655
PURE UUID: 4f302fb0-f017-4d92-a374-9af511e107a0
ORCID for Katayoun Farrahi: ORCID iD orcid.org/0000-0001-6775-127X

Catalogue record

Date deposited: 09 May 2024 16:32
Last modified: 10 May 2024 01:51

Export record

Altmetrics

Contributors

Author: Wei He
Author: Katayoun Farrahi ORCID iD
Author: Bin Chen
Author: Bohua Peng
Author: Aline Villavicencio

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×