Quick-and-clean extraction of linked data entities from microblogs
Quick-and-clean extraction of linked data entities from microblogs
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities.
Our approach is able to significantly speed-up the semantic
analysis process by discarding retweets, tweets without preidentifiable entities, as well similar and redundant tweets,
while retaining information content.
We apply the approach on a corpus of 1.4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700, 000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost
set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE
tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.
5-12
Feyisetan, Oluwaseyi
3c8e9481-2b07-41d2-9d1c-f73ae8e95457
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Tinati, Ramine
580576ca-d80d-4101-ae34-478753abcebb
Luczak-roesch, Markus
6cfe587f-e02c-48e8-b2b8-543952ab50a7
Shadbolt, Nigel
5c5acdf4-ad42-49b6-81fe-e9db58c2caf7
4 September 2014
Feyisetan, Oluwaseyi
3c8e9481-2b07-41d2-9d1c-f73ae8e95457
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Tinati, Ramine
580576ca-d80d-4101-ae34-478753abcebb
Luczak-roesch, Markus
6cfe587f-e02c-48e8-b2b8-543952ab50a7
Shadbolt, Nigel
5c5acdf4-ad42-49b6-81fe-e9db58c2caf7
Feyisetan, Oluwaseyi, Simperl, Elena, Tinati, Ramine, Luczak-roesch, Markus and Shadbolt, Nigel
(2014)
Quick-and-clean extraction of linked data entities from microblogs.
Proceedings of the 10th International Conference on Semantic Systems: SEM '14, , Leipzig, Germany.
04 Sep 2014 - 05 Sep 2016 .
.
(doi:10.1145/2660517.2660527).
Record type:
Conference or Workshop Item
(Paper)
Abstract
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities.
Our approach is able to significantly speed-up the semantic
analysis process by discarding retweets, tweets without preidentifiable entities, as well similar and redundant tweets,
while retaining information content.
We apply the approach on a corpus of 1.4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700, 000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost
set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE
tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.
This record has no associated files available for download.
More information
Published date: 4 September 2014
Venue - Dates:
Proceedings of the 10th International Conference on Semantic Systems: SEM '14, , Leipzig, Germany, 2014-09-04 - 2016-09-05
Identifiers
Local EPrints ID: 455901
URI: http://eprints.soton.ac.uk/id/eprint/455901
PURE UUID: d195b947-e92d-4db8-a32e-bdc00ed674c4
Catalogue record
Date deposited: 07 Apr 2022 16:55
Last modified: 16 Mar 2024 16:58
Export record
Altmetrics
Contributors
Author:
Oluwaseyi Feyisetan
Author:
Ramine Tinati
Author:
Markus Luczak-roesch
Author:
Nigel Shadbolt
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics