Representation transfer and data cleaning in multi-views for text simplification

Representation transfer is a widely used technique in natural language processing. We propose methods of cleaning the dominant dataset of text simplification (TS) WikiLarge in multi-views to remove errors that impact model training and fine-tuning. The results show that our method can effectively refine the dataset. We propose to take the pre-trained text representations from a similar task (e.g., text summarization) to text simplification to conduct a continue-fine-tuning strategy to improve the performance of pre-trained models on TS. This approach will speed up the training and make the model convergence easier. Besides, we also propose a new decoding strategy for simple text generation. It is able to generate simpler and more comprehensible text with controllable lexical simplicity. The experimental results show that our method can achieve good performance on many evaluation metrics.

Data cleaning, Decoding, Pre-trained language model, Sentence representation, Text simplification

10.1016/j.patrec.2023.11.011

0167-8655

40-46

He, Wei

8ca42b4c-b746-42ff-b9ec-43c92d89581a

Farrahi, Katayoun

bc848b9c-fc32-475c-b241-f6ade8babacb

Chen, Bin

c57720bd-1de9-4f03-9f30-f740d9efe876

Peng, Bohua

2b9bff20-ab84-495d-8275-dcefb645dae1

Villavicencio, Aline

edf8c965-a3ab-4674-9192-5353d7b9ef38

5 December 2023

He, Wei

8ca42b4c-b746-42ff-b9ec-43c92d89581a

Farrahi, Katayoun

bc848b9c-fc32-475c-b241-f6ade8babacb

Chen, Bin

c57720bd-1de9-4f03-9f30-f740d9efe876

Peng, Bohua

2b9bff20-ab84-495d-8275-dcefb645dae1

Villavicencio, Aline

edf8c965-a3ab-4674-9192-5353d7b9ef38

He, Wei, Farrahi, Katayoun, Chen, Bin, Peng, Bohua and Villavicencio, Aline (2023) Representation transfer and data cleaning in multi-views for text simplification. Pattern Recognition Letters, 177, 40-46. (doi:10.1016/j.patrec.2023.11.011).

Record type: Article

Abstract

Text

1-s2.0-S0167865523003215-main - Version of Record

Available under License Creative Commons Attribution.

Download (936kB)

More information

Accepted/In Press date: 8 November 2023

e-pub ahead of print date: 10 November 2023

Published date: 5 December 2023

Keywords: Data cleaning, Decoding, Pre-trained language model, Sentence representation, Text simplification

Identifiers

Local EPrints ID: 489975

URI: http://eprints.soton.ac.uk/id/eprint/489975

DOI: doi:10.1016/j.patrec.2023.11.011

ISSN: 0167-8655

PURE UUID: 4f302fb0-f017-4d92-a374-9af511e107a0

ORCID for Katayoun Farrahi:

orcid.org/0000-0001-6775-127X

Catalogue record

Date deposited: 09 May 2024 16:32

Last modified: 21 Aug 2025 02:23

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Wei He

Author: Katayoun Farrahi

Author: Bin Chen

Author: Bohua Peng

Author: Aline Villavicencio

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information