(c) University of Southampton.
Link creation in most existing hypertext / hypermedia products is a time consuming task. Microcosm is an open hypermedia system, in which various dynamic techniques have been used to attempt to ease the task of linking large bodies of information. This paper introduces the Microcosm model, and focuses on a technique for using computational power for dynamically creating links, known as retrieval links. The algorithm used for creating retrieval links is described and some preliminary experiments to assess the value of this facility are discussed.
Contents.
Most hypermedia information systems depend upon specifically anchored links that have been manually created to allow the user to navigate through the information. However it is doubtful whether this method of linking information will scale up successfully to large information systems. Microcosm is an open hypermedia system that has been developed with the intention of dealing with large amounts of multimedia material [Fountain90], and permits the manual creation of dynamically anchored links which may be followed from any point in the system. It is also possible to request that the system computes dynamic links to offer to the user. Both of these facilities reduce the amount of manual effort that is involved in creating links, but at the possible cost of reducing the quality of information available in the destination of the link. The implementation and integration of these computed links, known as retrieval links, is the primary topic of this paper.
Microcosm consists of a number of autonomous processes which communicate with each other by a message passing system. No information about links is held in the document data files in the form of mark-up. All data files remain in the native format of the application that created them. Instead, all link information is held in link databases, which hold details of the source anchor (if there is one), the destination anchor and any other attributes such as the link description. This model has the advantage that it is possible for processes to examine the complete link database as a separate item, and also it is possible to make link anchors in documents that are held on read only media such as CD-ROM and video disk.
Figure 1: The Microcosm Model
Microcosm allows a number of different actions to be taken on any selected item of interest, so consequently use of the system involves more than simply clicking on buttons to follow links. In Microcosm the user selects the item of interest (e.g. a piece of text) and then chooses an action to take. We may see this as selecting an object then sending it a message. A button in Microcosm is simply a binding of a specific selection and a particular action. A particular feature of Microcosm is the ability to generalise source anchors. In most hypertext systems the source anchor for any link is fixed at a particular point in the text. In Microcosm it is possible for the author to specify three levels of generality of link sources.
1) The generic link. The user will be able to follow the link after selecting the given anchor at any point in any document.Generic links are of considerable benefit to the author in that a new document may be created and immediately have access to all the generic links that have been defined for the system.2) The local link. The user will be able to follow the link after selecting the given anchor at any point in the current document.
3) The specific link. The user will be able to follow the link only after selecting the anchor at a specific location in the current document. Specific links may be made into buttons.
Figure 2: Following Links
The basic Microcosm processes are viewers and filters.
The task of the viewer is to allow the user to peruse the document, to make selections and to choose actions. Typical actions are follow link, make link and complete link (where links may be to processes as well as to documents). The actions themselves are not effected by the viewer. The viewer is responsible for binding the information into a message, which is sent on to the filter chain where it will look for one or more processes that can satisfy this request. Any Windows application might be used as a viewer, with the proviso that it is possible to select objects, and at the least copy them to the clipboard. Microcosm is capable of taking actions upon objects on the clipboard. In cases (such as Word for Windows and Superbase) where the application has some level of programmability, it is sometimes possible to add a menu to the application so that the application is a viewer in its own right.
A major strength of Microcosm is its ability to integrate other applications. In fact Microcosm may be seen as an umbrella environment, allowing the user to make links from documents in one application package to documents in another application package. While many other hypermedia systems, for example Guide, allow the user to follow links from a hypermedia document to an external application, the facility to follow links from an external application into a hypermedia application, and therefore on to another external application is unusual. Microcosm provides a link service which allows other applications to appear to have hypermedia functionality.
Link Databases.
Link Databases hold all the information referring to links. More than
one database may be installed at a time. It possible to have a concept of
public and private databases.
Show Links.
When a user selects a chunk of text and uses the Show Links
action this filter will organise the display of all available links,
including those links which are not buttons, such as generic links.
Compute Links
Sometimes no links have been defined for a particular subject. On these
occasions it is desirable to offer the user some further assistance. Microcosm
has a facility that allows a user to batch a set of text files and to index
these documents. Once this indexing has been done a block of text may be
selected and the action, compute link, may be chosen. The system will very
rapidly return a number of other documents within the system that have a
similar vocabulary to the selected block, in the order of best match. This
facility is analysed in greater detail in Section 3.
Other Navigational Filters
A well documented problem in hypermedia is that of remaining oriented
while navigating about a network of nodes. (e.g. [Hammond87]) Microcosm
provides some tools to assist users with navigation. There is a history
mechanism that keeps a list of all documents that have been visited, and allows
the user to return to a chosen point, a Mimic filter which allows the user to
follow a tour through a set of documents pre-defined by an author and a book
mark mechanism which allows users to mark document windows that they wish to
keep on the screen. Further research is currently being conducted into
methods of integrating other navigational tools, such as maps, into an open
hypermedia system.
Some Terminology:
A static link is a link for which both the source and the destination of the link have been well defined before it is used. The attributes of the static links are physically stored in the hypertext system. They may either be embedded in the information units (documents) or be stored separately.From the point of view of link creation, there are manually created links and computed links. From the point of view of following links, there are static links and dynamic links. A static link may be created either manually or by computing, whereas a dynamic link can not be created manually.A dynamic link is a link for which the source and/or the destination of the link have not been defined until the link is followed. The source and/or the destination of a dynamic link is decided according to some rules at the time that the link is followed.
Computed links are created by the system according to some predefined rules. Computed links may be either static links or dynamic links, depending on when the link creation is completed.
In the above terminology, most hypertext systems support manually created static links; the link attribute information is usually stored within the information unit or document in the form of hidden mark-up. However, by separating the link attributes from the documents we achieve certain advantages.
The ability to implement Generic links. These are manually created, dynamic links, in that the source anchor is not fixed.Clearly manually created static links (e.g. buttons) give the most specific access to information, since the author of the link has clearly defined the link between two specific pieces of information. This is why we call such links specific links within Microcosm. However such links also require a considerable amount of manual effort to create a suitable network through the information units.The system remains open in the sense that the documents remain unchanged by Microcosm, and may be viewed, edited and processed by other applications.
It is possible to batch process all the information in the link database.
Link databases may be passed between users.
Where specific links are not available, we have generic links. These allow us to follow a link from any point at which a given item, such as a word or phrase, occurs, to a specific destination. Clearly the quality of such information may be lower, as the author can only create the destination, and users may find ways of following such links from inappropriate sources. However, the manual effort is decreased as the author need only create the destination of the link once, and then the link will be available from any document.
Where no specific or generic links are available, we have retrieval-links. These are computed links which are dynamically created. Such links require no intervention from the author, but the quality of the information found by following such links may be lower than manually created links.
The remainder of this section is devoted to explaining how retrieval-links have been implemented within Microcosm.
1. Automatic indexing.In the Microcosm implementation, the retrieved documents are ranked before being presented to users. The ranking is based on the similarity between the text in the query and the textual content of each document. The number of retrieved documents is controlled by a set of rules to avoid too many documents being offered to the user.
Automatic indexing is a process that uses machine power to produce document identifiers which would be used to match with the identifier obtained from queries. Since one of our purposes of designing retrieval-links is to use machine power to replace some manual work, the ability to automatically index is essential.
2. Fast retrieval speed.
Because hypertext systems work interactively, it is not feasible to keep users waiting for a long time in order to follow a retrieval-link. Generally, a "follow link" operation should give a response as soon as possible. Research has shown that"the difference between one system with a response of several seconds and another with sub-second response is so great as to make them seem qualitatively different." [Akscyn88] 3. Ease of use.
Users of a hypertext system may have various backgrounds. To avoid the necessity of learning extra skills before using a retrieval-link, a retrieval mechanism should be designed to be easy to use. In other words, the retrieval operations should be similar to the operations of following an ordinary link. This requires that the information retrieval queries should be expressed in natural language so that users can directly select part of a document as the source of a retrieval-link.4. User controlled results.
The environments provided by hypertext / multimedia systems are suitable for browsing quite a large number of documents efficiently. This means that users are more involved in the retrieval process, and are able to assist in judging the usefulness of the retrieved documents. This user involvement also means that the retrieval process may have higher recall and a relative lower precision.
Here, the most important things to be considered, when designing an information retrieval mechanism for a hypertext system, are the speed of retrieval and the ability to automatically index. Conklin [Conklin87] pointed out that the most distinguishing characteristic of hypertext system is its machine support links, and another essential characteristic is the speed with which the system responds to link referencing.
Single term weightings in an inverted file can be extracted from documents by automatic indexing processing, can be stored in a simple data structure and then can be accessed quickly for retrieval calculation.
The following steps are used to form such an inverted file.
(1) Separate the words from the documents.It is easy to see that all words remained after removing the stop words are included in the inverted file. This obviously can increase recall of retrieval, and as addressed previously, this is what we need.
(2) Eliminate stop words by consulting a stop words list.
(3) Reduce words variants to a single form which is called the stem.
(4) Count the stem frequency for all remaining stems in each document and create weighted values for each stem in each document.
After studying the possible choices, we chose the stemming algorithm designed be Lovins [Lovins68].
This stemming algorithm is based on the longest-match. To stem a word means to remove the ending from the word. In order to do so, an ending list is used. Comparing endings of a word with the endings list, if more than one ending provides a match, then the ending which is longest should be removed. According to Lovins, to remove the longest ending is better than to remove several short endings one by one.
To cope with the spelling exceptions, two steps are used in stemming a word. First, the longest ending will be removed from a word, and then, the rest part of the word will be processed by another routine called the recording procedure. This will modify some of the remaining part according to 34 prefixed rules [Lovins68].
To save storage space used by the inverted file, a weighted values is stored by just one byte. This means there could be a maximum of 256 different weighted values. In other words, any weight value would finally be mapped to one of 256 values. We used a logarithmic function to achieve this mapping, rather than a linear function. This has the effect of giving much greater significance to the difference between a stem occurring once or twice, than to the difference between a stem occurring 99 or 100 times. If a query stem occurs twice in a document, the document is probably twice as significant as a document in which the stem only occurs once, whereas there is little difference in significance between documents containing 99 and 100 occurrences, which will both be heavily weighted anyway.
Let's suppose there are K stems in a query, then if the inverse document frequency method was used, the weight for stem i should be: wi=log(N/Ni) where N is the total number of documents in the collection and the Ni is the number of documents that shared the stem i. This weighting method, by favouring stems shared by less documents, emphasises the function of stems as devices to distinguish documents from the collection.
In the retrieval procedure designed for retrieval-links, the weight of the query stem i was defined as: wi=N/Ni, instead of wi=log(N/Ni). This change gave more emphasis to stems shared by less documents and simplified the similarity calculation. Since documents would be indexed by all the stems appearing in each document, it is important to stress the effect of the important stems.
where q stands for a given query that contains k stems, dj stands for document j, wi is weight for query stem i, stemij is weighted value of stem i in document j.
The similarity between the query and all documents needs to be computed in order to rank the documents according to their similarity to the query. Since in the inverted file all stemij are stored continuously for a given stem i, the process of calculating all S(q,dj) using the above formula is very fast.
(1) Calculate the average value of the similarities of all documents. Use the average value as a sub-threshold; only the documents whose similarity values are above the average value would be passed to next step.At present M=5 and P=10%. For a collection that has 35 documents, a maximum of 5 destinations would be permitted. For a collection that has 100 documents, a maximum of 10 documents could become destinations.(2) To avoid too many documents becoming destinations of a retrieval link, another sub-threshold whose value is defined as M is used to control the maximum number of destinations. Considering a document collection containing a large number of documents may have more relevant documents to a query than a small document collection, another sub-threshold whose value is defined as percentage P is used to control the maximum number of destinations of a retrieval-link.
(3) Rank the documents according to their similarity values in decreasing order. Then the max(M,P) documents that have bigger than average similarity values are chosen as the destinations of the retrieval-link.
In the above procedure, step one has the ability to control the relative quality of the retrieval results, step two is mainly used to avoid the relevant documents being buried in too many other documents.
Figure 3: Retrieval Links
To examine the performance of retrieval links, 20 departmental CSTRs (Computer Science Technical Reports) were used as a collection of documents.
As designed, the destinations of retrieval links are a group of documents. We expect that the destinations of retrieval-links are relevant to the source of the retrieval-links.
In one experiment, the titles of the 20 CSTRs were used as sources of the retrieval-links. In this case, we would expect that the destinations of retrieval-links would contain the document (expected-document) whose title was used as the source of the retrieval-links. Since the destinations of retrieval-links are ranked according their similarities to the source, we also expect that the expected-document will rank above all the other documents. The table below gives the experimental results.
Ranking of number of documents percentage expected that satisfied the document ranking 1 17 85% 2 2 10% 3 0 0% 4 1 5%Figure 4: Experimental results: using document titles as the sources of retrieval-links.
In the above table, we can see that in most cases (85%), the expected-documents were ranked top of the destinations of retrieval links. Notice that there were no special weights for titles, authors names, and abstracts of documents in the inverted file, so a title received no more weight than a sentence in the document. In another experiment with 100 RFC network protocol documents, 92% of the expected documents were ranked first. In a further experiment, some selected keywords were used as the source of retrieval-links and the results were also satisfactory. The decision to chose all stems from documents as keywords was made based on a similar consideration. Such results shows that the indexing procedures and similarity function calculations used by the retrieval mechanism produce satisfactory results, and further work will be conducted to verify these results using the CACM collection.
The response time of retrieval-links was tested for a document collection containing 176 documents. Experiments showed that when several words were used as the source of a link, there was no significant delay to follow a retrieval-link. For a query with about one hundred words, it needs a couple of seconds to produces the destinations of retrieval-links.
The retrieval mechanism is based on the inverted file which contains extracted information about each document. Since the retrieval process needs to access weighted values of a stem on all documents, while inverted file creation process can only process the documents one by one, a data structure for the inverted file must be chosen that either ensures a quick retrieval process or makes the creation of the inverted file simple. To ensure fast retrieval response, the data structure of the inverted file is optimised in the form that can be quickly accessed by the retrieval process. This makes inverted file creation hard work. By carefully designing the creation process, we can now create inverted files in reasonable time. For instance, to index 176 documents (total size over 2MB) takes about 15 minutes.
There are a number of possible further improvements that we are currently investigating.
It would be possible to allow authors a tool to use retrieval links to suggest links that might be followed, then to choose which links would made into static links by including them into the link database.It is our belief that a computed links facility, enhanced as described above would make an extremely accurate and useful facility for both authors and users, and will enable hypermedia systems to make the transition into dealing with large information systems.One of the link attributes, stored in the link database is a short textual description of the data that will be found by following the link. We are investigating the possibility of using this information for either enhancing the current computed links accuracy, or for offering some level of computed links in an information set that has not been pre-indexed.
Retrieval links currently offer destinations that are whole documents. We are investigating the possibility of enhancing the system to automatically offer destinations within the document when such a link is followed.
We are investigating the use of a synonym filter, that would increase the probability that a search for a given word in an index would be successful.
We are attempting to extend the index to include certain word pairs (phrases). It seems likely that where a phrase occurs in a query and a document there is a higher probability that this is a useful match than in the case where the words appear separately in the document. The current system is unable to distinguish between these two cases.
[Conklin87] J. Conklin. Hypertext: an introduction and survey. IEEE trans. computer, 1987.
[Davis92] Hugh Davis, Wendy Hall, Gerard Hutchings, David Rush and Rob Wilkins. Hypermedia and the Teaching of Computer Science: Evaluating an Open System. in David Bateman and Tim Hopkins (eds). The Proceedings of the Conference on Developments in the Teaching of Computer Science. The University of Kent. 1992.
[Fountain90] Andrew M. Fountain and Wendy Hall and Ian Heath and Hugh C. Davis. MICROCOSM: An Open Model for Hypermedia With Dynamic Linking in A.Rizk and N.Streitz and J. Andre (eds).
Hypertext: Concepts, Systems and Applications. The Proceedings of The European Conference on Hypertext, INRIA, France, November 1990. Cambridge University Press 1990.
[Hammond88] Hammond, N.V and Allinson, L.J. Travels around a learning support environment: rambling, orienteering or touring? In Soloway, E, Frye, D and Sheppard, S.B. (eds), CHI '88 Conference Proceedings: Human Factors in Computer Systems, ACM Press: New York, 269-273. 1988
[Jones72] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21, 1972
[Lovins68] Julie Beth Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 4:22-31, 1968.
Zhuoxun Li, Hugh Davis and Wendy Hall
Department of Electronics and Computer Science
University of Southampton
Southampton SO9 5NH
e-mail mcm@ecs.soton.ac.uk