Large scale acquisition and maintenance from the web without source access


Leonard, Thomas and Glaser, Hugh (2001) Large scale acquisition and maintenance from the web without source access. Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001 , 97-101.

Download

[img] PDF
Download (169Kb)

Description/Abstract

Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.

Item Type: Conference or Workshop Item (UNSPECIFIED)
Additional Information: Organisation: ACM (SIGART), AAAI
Divisions: Faculty of Physical Sciences and Engineering > Electronics and Computer Science
Faculty of Physical Sciences and Engineering > Electronics and Computer Science > Web & Internet Science
Faculty of Physical Sciences and Engineering > Electronics and Computer Science > IT Innovation Centre
ePrint ID: 256185
Date Deposited: 17 Dec 2001
Last Modified: 27 Mar 2014 19:58
Further Information:Google Scholar
URI: http://eprints.soton.ac.uk/id/eprint/256185

Actions (login required)

View Item View Item