The University of Southampton
University of Southampton Institutional Repository

Quick-and-clean extraction of linked data entities from microblogs

Quick-and-clean extraction of linked data entities from microblogs
Quick-and-clean extraction of linked data entities from microblogs
In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities.
Our approach is able to significantly speed-up the semantic
analysis process by discarding retweets, tweets without preidentifiable entities, as well similar and redundant tweets,
while retaining information content.
We apply the approach on a corpus of 1.4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700, 000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost
set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE
tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.
5-12
Feyisetan, Oluwaseyi
3c8e9481-2b07-41d2-9d1c-f73ae8e95457
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Tinati, Ramine
580576ca-d80d-4101-ae34-478753abcebb
Luczak-roesch, Markus
6cfe587f-e02c-48e8-b2b8-543952ab50a7
Shadbolt, Nigel
5c5acdf4-ad42-49b6-81fe-e9db58c2caf7
Feyisetan, Oluwaseyi
3c8e9481-2b07-41d2-9d1c-f73ae8e95457
Simperl, Elena
40261ae4-c58c-48e4-b78b-5187b10e4f67
Tinati, Ramine
580576ca-d80d-4101-ae34-478753abcebb
Luczak-roesch, Markus
6cfe587f-e02c-48e8-b2b8-543952ab50a7
Shadbolt, Nigel
5c5acdf4-ad42-49b6-81fe-e9db58c2caf7

Feyisetan, Oluwaseyi, Simperl, Elena, Tinati, Ramine, Luczak-roesch, Markus and Shadbolt, Nigel (2014) Quick-and-clean extraction of linked data entities from microblogs. Proceedings of the 10th International Conference on Semantic Systems: SEM '14, , Leipzig, Germany. 04 Sep 2014 - 05 Sep 2016 . pp. 5-12 . (doi:10.1145/2660517.2660527).

Record type: Conference or Workshop Item (Paper)

Abstract

In this paper, we address the problem of finding Named Entities in very large micropost datasets. We propose methods to generate a sample of representative microposts by discovering tweets that are likely to refer to new entities.
Our approach is able to significantly speed-up the semantic
analysis process by discarding retweets, tweets without preidentifiable entities, as well similar and redundant tweets,
while retaining information content.
We apply the approach on a corpus of 1.4 billion microposts, using the IE services of AlchemyAPI, Calais, and Zemanta to identify more than 700, 000 unique entities. For the evaluation we compare runtime and number of entities extracted based on the full and the downscaled version of a micropost
set. We are able to demonstrate that for datasets of more than 10 million tweets we can achieve a reduction in size of more than 80% while maintaining up to 60% coverage on unique entities cumulatively discovered by the three IE
tools.
We publish the resulting Twitter metadata as Linked Data using SIOC and an extension of the NERD core ontology.

This record has no associated files available for download.

More information

Published date: 4 September 2014
Venue - Dates: Proceedings of the 10th International Conference on Semantic Systems: SEM '14, , Leipzig, Germany, 2014-09-04 - 2016-09-05

Identifiers

Local EPrints ID: 455901
URI: http://eprints.soton.ac.uk/id/eprint/455901
PURE UUID: d195b947-e92d-4db8-a32e-bdc00ed674c4
ORCID for Elena Simperl: ORCID iD orcid.org/0000-0003-1722-947X

Catalogue record

Date deposited: 07 Apr 2022 16:55
Last modified: 16 Mar 2024 16:58

Export record

Altmetrics

Contributors

Author: Oluwaseyi Feyisetan
Author: Elena Simperl ORCID iD
Author: Ramine Tinati
Author: Markus Luczak-roesch
Author: Nigel Shadbolt

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×