The University of Southampton
University of Southampton Institutional Repository

T-REx: A large scale alignment of natural language with knowledge base triples

T-REx: A large scale alignment of natural language with knowledge base triples
T-REx: A large scale alignment of natural language with knowledge base triples
Alignments between natural language and Knowledge Base (KB) triples are an essential prerequisite for training machine learning approaches employed in a variety of Natural Language Processing problems. These include Relation Extraction, KB Population, Question Answering and Natural Language Generation from KB triples. Available datasets that provide those alignments are plagued by significant shortcomings – they are of limited size, they exhibit a restricted predicate coverage, and/or they are of unreported quality. To alleviate these shortcomings, we present T-REx, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences). T-REx is two orders of magnitude larger than the largest available alignments dataset and covers 2.5 times more predicates. Additionally, we stress the quality of this language resource thanks to an extensive crowdsourcing evaluation. T-REx is publicly available at: https://w3id.org/t-rex.
3448-3452
Elsahar, Hady
04528e31-9e9e-4de3-99ce-b6221889e912
Vougiouklis, Pavlos
4cd0a8f1-c5e2-4ba2-8dcd-753db616b215
Remaci, Arslen
7cae9d16-bd76-40db-b073-af50703be74d
Gravier, Christophe
3d1a8495-afbd-4a61-b19b-a00036d4e74b
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Laforest, Frederique
f61f682e-55a5-4626-a8d6-52aa2f3809d6
Elsahar, Hady
04528e31-9e9e-4de3-99ce-b6221889e912
Vougiouklis, Pavlos
4cd0a8f1-c5e2-4ba2-8dcd-753db616b215
Remaci, Arslen
7cae9d16-bd76-40db-b073-af50703be74d
Gravier, Christophe
3d1a8495-afbd-4a61-b19b-a00036d4e74b
Hare, Jonathon
65ba2cda-eaaf-4767-a325-cd845504e5a9
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Laforest, Frederique
f61f682e-55a5-4626-a8d6-52aa2f3809d6

Elsahar, Hady, Vougiouklis, Pavlos, Remaci, Arslen, Gravier, Christophe, Hare, Jonathon, Simperl, Elena and Laforest, Frederique (2019) T-REx: A large scale alignment of natural language with knowledge base triples. In LREC 2018 - 11th International Conference on Language Resources and Evaluation. pp. 3448-3452 .

Record type: Conference or Workshop Item (Paper)

Abstract

Alignments between natural language and Knowledge Base (KB) triples are an essential prerequisite for training machine learning approaches employed in a variety of Natural Language Processing problems. These include Relation Extraction, KB Population, Question Answering and Natural Language Generation from KB triples. Available datasets that provide those alignments are plagued by significant shortcomings – they are of limited size, they exhibit a restricted predicate coverage, and/or they are of unreported quality. To alleviate these shortcomings, we present T-REx, a dataset of large scale alignments between Wikipedia abstracts and Wikidata triples. T-REx consists of 11 million triples aligned with 3.09 million Wikipedia abstracts (6.2 million sentences). T-REx is two orders of magnitude larger than the largest available alignments dataset and covers 2.5 times more predicates. Additionally, we stress the quality of this language resource thanks to an extensive crowdsourcing evaluation. T-REx is publicly available at: https://w3id.org/t-rex.

Text
632 - Author's Original
Download (426kB)

More information

Submitted date: 2 October 2017
Accepted/In Press date: 13 December 2017
Published date: 2019

Identifiers

Local EPrints ID: 417557
URI: https://eprints.soton.ac.uk/id/eprint/417557
PURE UUID: 0e5eae34-650d-4829-bb16-c37e0acac81b
ORCID for Jonathon Hare: ORCID iD orcid.org/0000-0003-2921-4283
ORCID for Elena Simperl: ORCID iD orcid.org/0000-0003-1722-947X

Catalogue record

Date deposited: 02 Feb 2018 17:31
Last modified: 20 Jul 2019 00:54

Export record

Contributors

Author: Hady Elsahar
Author: Pavlos Vougiouklis
Author: Arslen Remaci
Author: Christophe Gravier
Author: Jonathon Hare ORCID iD
Author: Elena Simperl ORCID iD
Author: Frederique Laforest

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×