Large scale acquisition and maintenance from the web without source access
Large scale acquisition and maintenance from the web without source access
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
97-101
Leonard, Thomas
2db98a87-70fb-435e-8d4c-058278c454c5
Glaser, Hugh
df88ca22-a72f-4fb6-9784-6578737d8af4
Handschuh, Siegfried
9ee1ba54-810e-455a-a8b7-05f64ff7946a
Dieng-Kuntz, Rose
01b0ddec-6c8d-4641-92b3-8b5a223b23cc
Staab, Steffan
2487d120-a7ef-463c-b2ca-aaba204c33b4
October 2001
Leonard, Thomas
2db98a87-70fb-435e-8d4c-058278c454c5
Glaser, Hugh
df88ca22-a72f-4fb6-9784-6578737d8af4
Handschuh, Siegfried
9ee1ba54-810e-455a-a8b7-05f64ff7946a
Dieng-Kuntz, Rose
01b0ddec-6c8d-4641-92b3-8b5a223b23cc
Staab, Steffan
2487d120-a7ef-463c-b2ca-aaba204c33b4
Leonard, Thomas and Glaser, Hugh
(2001)
Large scale acquisition and maintenance from the web without source access.
Handschuh, Siegfried, Dieng-Kuntz, Rose and Staab, Steffan
(eds.)
Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001.
.
Record type:
Conference or Workshop Item
(Other)
Abstract
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
More information
Published date: October 2001
Additional Information:
Organisation: ACM (SIGART), AAAI
Venue - Dates:
Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001, 2001-10-01
Organisations:
Web & Internet Science, Electronics & Computer Science, IT Innovation
Identifiers
Local EPrints ID: 256185
URI: http://eprints.soton.ac.uk/id/eprint/256185
PURE UUID: b0853436-f35b-45f5-8dce-58ff007b41cc
Catalogue record
Date deposited: 17 Dec 2001
Last modified: 14 Mar 2024 05:39
Export record
Contributors
Author:
Thomas Leonard
Author:
Hugh Glaser
Editor:
Siegfried Handschuh
Editor:
Rose Dieng-Kuntz
Editor:
Steffan Staab
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics