ParaCrawl: web-scale acquisition of parallel corpora
ParaCrawl: web-scale acquisition of parallel corpora
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
4555–4567
Bañón, Marta
e22a4d1b-f5d8-4e26-88b8-f039b7ed2800
Chen, Pinzhen
df336b2a-d200-449d-b4bb-b7043d594da6
Haddow, Barry
54ca2e36-a0f5-4527-a90f-4e619d36fa13
Heafield, Kenneth
de39c1c6-7538-44d1-abcc-3dee42c972a9
Hoang, Hieu
11489bb1-5914-4af1-a718-55c361cb0dda
Esplà-Gomis, Miquel
5445c620-cfa5-40a3-874d-daebc3ffffad
Forcada, Mikel
bbe3a760-57b4-490e-bd4a-7a96a13c0e4a
Kamran, Amir
f353b842-6ece-47fb-8353-8db9b4a4a7c4
Kirefu, Faheem
5d4cf6c9-34bb-4f43-9174-a624c366bca3
Koehn, Philipp
b8ce2cb9-0b27-49f7-8efa-247202400f6b
Ortiz-Rojas, Sergio
17447aea-13f5-4676-a67d-a747bfd3122e
Pla, Leopoldo
fa0998f5-558d-4ba5-a351-a0281aa41d20
Ramírez-Sánchez, Gema
9c51ae55-3181-4e40-b463-7843a6b1dcc8
Sarrías, Elsa
39c7bfa1-d1b4-475f-ba6a-072729b84d05
Strelec, Marek
3de60082-81b5-4d41-860a-52f2144be9d6
Thompson, Brian
ba5d5cab-bf2f-419a-981f-34f345aa880f
Waites, William
a069e5ff-f440-4b89-ae81-3b58c2ae2afd
Wiggins, Dion
13d9a946-336e-460c-bdd3-71b394026045
Zaragoza, Jaume
33d1118a-573e-4d7f-aef2-81f5a778a367
10 July 2020
Bañón, Marta
e22a4d1b-f5d8-4e26-88b8-f039b7ed2800
Chen, Pinzhen
df336b2a-d200-449d-b4bb-b7043d594da6
Haddow, Barry
54ca2e36-a0f5-4527-a90f-4e619d36fa13
Heafield, Kenneth
de39c1c6-7538-44d1-abcc-3dee42c972a9
Hoang, Hieu
11489bb1-5914-4af1-a718-55c361cb0dda
Esplà-Gomis, Miquel
5445c620-cfa5-40a3-874d-daebc3ffffad
Forcada, Mikel
bbe3a760-57b4-490e-bd4a-7a96a13c0e4a
Kamran, Amir
f353b842-6ece-47fb-8353-8db9b4a4a7c4
Kirefu, Faheem
5d4cf6c9-34bb-4f43-9174-a624c366bca3
Koehn, Philipp
b8ce2cb9-0b27-49f7-8efa-247202400f6b
Ortiz-Rojas, Sergio
17447aea-13f5-4676-a67d-a747bfd3122e
Pla, Leopoldo
fa0998f5-558d-4ba5-a351-a0281aa41d20
Ramírez-Sánchez, Gema
9c51ae55-3181-4e40-b463-7843a6b1dcc8
Sarrías, Elsa
39c7bfa1-d1b4-475f-ba6a-072729b84d05
Strelec, Marek
3de60082-81b5-4d41-860a-52f2144be9d6
Thompson, Brian
ba5d5cab-bf2f-419a-981f-34f345aa880f
Waites, William
a069e5ff-f440-4b89-ae81-3b58c2ae2afd
Wiggins, Dion
13d9a946-336e-460c-bdd3-71b394026045
Zaragoza, Jaume
33d1118a-573e-4d7f-aef2-81f5a778a367
Bañón, Marta, Chen, Pinzhen, Haddow, Barry, Heafield, Kenneth, Hoang, Hieu, Esplà-Gomis, Miquel, Forcada, Mikel, Kamran, Amir, Kirefu, Faheem, Koehn, Philipp, Ortiz-Rojas, Sergio, Pla, Leopoldo, Ramírez-Sánchez, Gema, Sarrías, Elsa, Strelec, Marek, Thompson, Brian, Waites, William, Wiggins, Dion and Zaragoza, Jaume
(2020)
ParaCrawl: web-scale acquisition of parallel corpora.
In ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference.
.
(doi:10.18653/v1/2020.acl-main.417).
Record type:
Conference or Workshop Item
(Paper)
Abstract
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
This record has no associated files available for download.
More information
Published date: 10 July 2020
Identifiers
Local EPrints ID: 500047
URI: http://eprints.soton.ac.uk/id/eprint/500047
PURE UUID: 88484d68-d687-43c9-ba80-393e8af37937
Catalogue record
Date deposited: 14 Apr 2025 16:35
Last modified: 15 Apr 2025 02:39
Export record
Altmetrics
Contributors
Author:
Marta Bañón
Author:
Pinzhen Chen
Author:
Barry Haddow
Author:
Kenneth Heafield
Author:
Hieu Hoang
Author:
Miquel Esplà-Gomis
Author:
Mikel Forcada
Author:
Amir Kamran
Author:
Faheem Kirefu
Author:
Philipp Koehn
Author:
Sergio Ortiz-Rojas
Author:
Leopoldo Pla
Author:
Gema Ramírez-Sánchez
Author:
Elsa Sarrías
Author:
Marek Strelec
Author:
Brian Thompson
Author:
William Waites
Author:
Dion Wiggins
Author:
Jaume Zaragoza
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics