Leonard, Thomas and Glaser, Hugh,
Large scale acquisition and maintenance from the web without source access
Handschuh, Siegfried, Dieng-Kuntz, Rose and Staab, Steffan (eds.)
At Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001.
Although different web sites structure their pages differently, the pages within a single site are often generated from a database and have a regular layout from which it is possible to extract information automatically. Dome is a visual tool for manipulating tree-structured documents. It can import and export in XML or HTML formats, making it ideal for harvesting information from web pages. Editing is performed using a direct manipulation interface and the operations are recorded for later playback. The knowledge extracted from a web page may be updated by replaying the recorded sequence when the source page changes. The same sequence can be applied to other pages with a similar format, and facilities are provided to batch process a large collection of pages in one operation. In this paper we describe how Dome may be used to extract knowledge from web sites in such a way that the extraction process may be reliably replayed.
Conference or Workshop Item
||Organisation: ACM (SIGART), AAAI
|Venue - Dates:
||Workshop 4, Knowledge Markup and Semantic Annotation, K-CAP 2001, 2001-10-01
||Web & Internet Science, Electronics & Computer Science, IT Innovation
||17 Dec 2001
||17 Apr 2017 23:08
|Further Information:||Google Scholar|
Actions (login required)