The University of Southampton
University of Southampton Institutional Repository

ParaCrawl: web-scale acquisition of parallel corpora

ParaCrawl: web-scale acquisition of parallel corpora
ParaCrawl: web-scale acquisition of parallel corpora
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
4555–4567
Bañón, Marta
e22a4d1b-f5d8-4e26-88b8-f039b7ed2800
Chen, Pinzhen
df336b2a-d200-449d-b4bb-b7043d594da6
Haddow, Barry
54ca2e36-a0f5-4527-a90f-4e619d36fa13
Heafield, Kenneth
de39c1c6-7538-44d1-abcc-3dee42c972a9
Hoang, Hieu
11489bb1-5914-4af1-a718-55c361cb0dda
Esplà-Gomis, Miquel
5445c620-cfa5-40a3-874d-daebc3ffffad
Forcada, Mikel
bbe3a760-57b4-490e-bd4a-7a96a13c0e4a
Kamran, Amir
f353b842-6ece-47fb-8353-8db9b4a4a7c4
Kirefu, Faheem
5d4cf6c9-34bb-4f43-9174-a624c366bca3
Koehn, Philipp
b8ce2cb9-0b27-49f7-8efa-247202400f6b
Ortiz-Rojas, Sergio
17447aea-13f5-4676-a67d-a747bfd3122e
Pla, Leopoldo
fa0998f5-558d-4ba5-a351-a0281aa41d20
Ramírez-Sánchez, Gema
9c51ae55-3181-4e40-b463-7843a6b1dcc8
Sarrías, Elsa
39c7bfa1-d1b4-475f-ba6a-072729b84d05
Strelec, Marek
3de60082-81b5-4d41-860a-52f2144be9d6
Thompson, Brian
ba5d5cab-bf2f-419a-981f-34f345aa880f
Waites, William
a069e5ff-f440-4b89-ae81-3b58c2ae2afd
Wiggins, Dion
13d9a946-336e-460c-bdd3-71b394026045
Zaragoza, Jaume
33d1118a-573e-4d7f-aef2-81f5a778a367
Bañón, Marta
e22a4d1b-f5d8-4e26-88b8-f039b7ed2800
Chen, Pinzhen
df336b2a-d200-449d-b4bb-b7043d594da6
Haddow, Barry
54ca2e36-a0f5-4527-a90f-4e619d36fa13
Heafield, Kenneth
de39c1c6-7538-44d1-abcc-3dee42c972a9
Hoang, Hieu
11489bb1-5914-4af1-a718-55c361cb0dda
Esplà-Gomis, Miquel
5445c620-cfa5-40a3-874d-daebc3ffffad
Forcada, Mikel
bbe3a760-57b4-490e-bd4a-7a96a13c0e4a
Kamran, Amir
f353b842-6ece-47fb-8353-8db9b4a4a7c4
Kirefu, Faheem
5d4cf6c9-34bb-4f43-9174-a624c366bca3
Koehn, Philipp
b8ce2cb9-0b27-49f7-8efa-247202400f6b
Ortiz-Rojas, Sergio
17447aea-13f5-4676-a67d-a747bfd3122e
Pla, Leopoldo
fa0998f5-558d-4ba5-a351-a0281aa41d20
Ramírez-Sánchez, Gema
9c51ae55-3181-4e40-b463-7843a6b1dcc8
Sarrías, Elsa
39c7bfa1-d1b4-475f-ba6a-072729b84d05
Strelec, Marek
3de60082-81b5-4d41-860a-52f2144be9d6
Thompson, Brian
ba5d5cab-bf2f-419a-981f-34f345aa880f
Waites, William
a069e5ff-f440-4b89-ae81-3b58c2ae2afd
Wiggins, Dion
13d9a946-336e-460c-bdd3-71b394026045
Zaragoza, Jaume
33d1118a-573e-4d7f-aef2-81f5a778a367

Bañón, Marta, Chen, Pinzhen, Haddow, Barry, Heafield, Kenneth, Hoang, Hieu, Esplà-Gomis, Miquel, Forcada, Mikel, Kamran, Amir, Kirefu, Faheem, Koehn, Philipp, Ortiz-Rojas, Sergio, Pla, Leopoldo, Ramírez-Sánchez, Gema, Sarrías, Elsa, Strelec, Marek, Thompson, Brian, Waites, William, Wiggins, Dion and Zaragoza, Jaume (2020) ParaCrawl: web-scale acquisition of parallel corpora. In ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference. 4555–4567 . (doi:10.18653/v1/2020.acl-main.417).

Record type: Conference or Workshop Item (Paper)

Abstract

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

This record has no associated files available for download.

More information

Published date: 10 July 2020

Identifiers

Local EPrints ID: 500047
URI: http://eprints.soton.ac.uk/id/eprint/500047
PURE UUID: 88484d68-d687-43c9-ba80-393e8af37937
ORCID for William Waites: ORCID iD orcid.org/0000-0002-7759-6805

Catalogue record

Date deposited: 14 Apr 2025 16:35
Last modified: 15 Apr 2025 02:39

Export record

Altmetrics

Contributors

Author: Marta Bañón
Author: Pinzhen Chen
Author: Barry Haddow
Author: Kenneth Heafield
Author: Hieu Hoang
Author: Miquel Esplà-Gomis
Author: Mikel Forcada
Author: Amir Kamran
Author: Faheem Kirefu
Author: Philipp Koehn
Author: Sergio Ortiz-Rojas
Author: Leopoldo Pla
Author: Gema Ramírez-Sánchez
Author: Elsa Sarrías
Author: Marek Strelec
Author: Brian Thompson
Author: William Waites ORCID iD
Author: Dion Wiggins
Author: Jaume Zaragoza

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×