The University of Southampton
University of Southampton Institutional Repository

Large scale acquisition and maintenance from the web without source access

Large scale acquisition and maintenance from the web without source access
Large scale acquisition and maintenance from the web without source access
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
97-101
Leonard, Thomas
2db98a87-70fb-435e-8d4c-058278c454c5
Glaser, Hugh
df88ca22-a72f-4fb6-9784-6578737d8af4
Handschuh, Siegfried
9ee1ba54-810e-455a-a8b7-05f64ff7946a
Dieng-Kuntz, Rose
01b0ddec-6c8d-4641-92b3-8b5a223b23cc
Staab, Steffan
2487d120-a7ef-463c-b2ca-aaba204c33b4
Leonard, Thomas
2db98a87-70fb-435e-8d4c-058278c454c5
Glaser, Hugh
df88ca22-a72f-4fb6-9784-6578737d8af4
Handschuh, Siegfried
9ee1ba54-810e-455a-a8b7-05f64ff7946a
Dieng-Kuntz, Rose
01b0ddec-6c8d-4641-92b3-8b5a223b23cc
Staab, Steffan
2487d120-a7ef-463c-b2ca-aaba204c33b4

Leonard, Thomas and Glaser, Hugh, (2001) Large scale acquisition and maintenance from the web without source access Handschuh, Siegfried, Dieng-Kuntz, Rose and Staab, Steffan (eds.) At Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001. , pp. 97-101.

Record type: Conference or Workshop Item (Other)

Abstract

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

PDF Paper.pdf - Other
Download (173kB)

More information

Published date: October 2001
Additional Information: Organisation: ACM (SIGART), AAAI
Venue - Dates: Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001, 2001-10-01
Organisations: Web & Internet Science, Electronics & Computer Science, IT Innovation

Identifiers

Local EPrints ID: 256185
URI: http://eprints.soton.ac.uk/id/eprint/256185
PURE UUID: b0853436-f35b-45f5-8dce-58ff007b41cc

Catalogue record

Date deposited: 17 Dec 2001
Last modified: 18 Jul 2017 09:48

Export record

Contributors

Author: Thomas Leonard
Author: Hugh Glaser
Editor: Siegfried Handschuh
Editor: Rose Dieng-Kuntz
Editor: Steffan Staab

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff edit
Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×