Large scale acquisition and maintenance from the web without source access

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

97-101

Leonard, Thomas

2db98a87-70fb-435e-8d4c-058278c454c5

Glaser, Hugh

df88ca22-a72f-4fb6-9784-6578737d8af4

Handschuh, Siegfried

9ee1ba54-810e-455a-a8b7-05f64ff7946a

Dieng-Kuntz, Rose

01b0ddec-6c8d-4641-92b3-8b5a223b23cc

Staab, Steffan

2487d120-a7ef-463c-b2ca-aaba204c33b4

October 2001

Leonard, Thomas

2db98a87-70fb-435e-8d4c-058278c454c5

Glaser, Hugh

df88ca22-a72f-4fb6-9784-6578737d8af4

Handschuh, Siegfried

9ee1ba54-810e-455a-a8b7-05f64ff7946a

Dieng-Kuntz, Rose

01b0ddec-6c8d-4641-92b3-8b5a223b23cc

Staab, Steffan

2487d120-a7ef-463c-b2ca-aaba204c33b4

Leonard, Thomas and Glaser, Hugh (2001) Large scale acquisition and maintenance from the web without source access. Handschuh, Siegfried, Dieng-Kuntz, Rose and Staab, Steffan (eds.) Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001. pp. 97-101 .

Record type: Conference or Workshop Item (Other)

Abstract

Text

Paper.pdf - Other

Download (173kB)

More information

Published date: October 2001

Additional Information: Organisation: ACM (SIGART), AAAI

Venue - Dates: Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001, 2001-10-01

Organisations: Web & Internet Science, Electronics & Computer Science, IT Innovation

Learn more about the Electronics & Computer Science

Identifiers

Local EPrints ID: 256185

URI: http://eprints.soton.ac.uk/id/eprint/256185

PURE UUID: b0853436-f35b-45f5-8dce-58ff007b41cc

Catalogue record

Date deposited: 17 Dec 2001

Last modified: 14 Mar 2024 05:39

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Thomas Leonard

Author: Hugh Glaser

Editor: Siegfried Handschuh

Editor: Rose Dieng-Kuntz

Editor: Steffan Staab

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information