The University of Southampton
University of Southampton Institutional Repository

Large scale acquisition and maintenance from the web without source access

Record type: Conference or Workshop Item (Other)

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

PDF Paper.pdf - Other
Download (173kB)

Citation

Leonard, Thomas and Glaser, Hugh, (2001) Large scale acquisition and maintenance from the web without source access Handschuh, Siegfried, Dieng-Kuntz, Rose and Staab, Steffan (eds.) At Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001. , pp. 97-101.

More information

Published date: October 2001
Additional Information: Organisation: ACM (SIGART), AAAI
Venue - Dates: Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001, 2001-10-01
Organisations: Web & Internet Science, Electronics & Computer Science, IT Innovation

Identifiers

Local EPrints ID: 256185
URI: http://eprints.soton.ac.uk/id/eprint/256185
PURE UUID: b0853436-f35b-45f5-8dce-58ff007b41cc

Catalogue record

Date deposited: 17 Dec 2001
Last modified: 18 Jul 2017 09:48

Export record

Contributors

Author: Thomas Leonard
Author: Hugh Glaser
Editor: Siegfried Handschuh
Editor: Rose Dieng-Kuntz
Editor: Steffan Staab

University divisions


Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×