Large scale acquisition and maintenance from the web without source access
Leonard, Thomas and Glaser, Hugh (2001) Large scale acquisition and maintenance from the web without source access. Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001 , 97-101.
Download
|
PDF
Download (169Kb) |
Description/Abstract
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
| Item Type: | Conference or Workshop Item (UNSPECIFIED) |
|---|---|
| Additional Information: | Organisation: ACM (SIGART), AAAI |
| Divisions: | Faculty of Physical and Applied Science > Electronics and Computer Science Faculty of Physical and Applied Science > Electronics and Computer Science > Web & Internet Science |
| Item ID: | 256185 |
| Date Deposited: | 17 Dec 2001 |
| Last Modified: | 02 Mar 2012 11:57 |
| Contributors: | Leonard, Thomas (Author) Glaser, Hugh (Author) Handschuh, Siegfried (Editor) Dieng-Kuntz, Rose (Editor) Staab, Steffan (Editor) |
| Date: | October 2001 |
| Additional Information: | Organisation: ACM (SIGART), AAAI |
| Status: | Published |
| Further Information: | Google Scholar |
| URI: | http://eprints.soton.ac.uk/id/eprint/256185 |
Actions (login required)
![]() |
View Item |


