Applying Open Hypertext Principles to the WWW

Gary HILL, Wendy HALL, Dave DE ROURE, Les CARR

Multimedia Research Group,
Department of Electronics and Computer Science,
University of Southampton,
Highfield,
Southampton,
Hants,
SO17 1BJ
UK

ABSTRACT

The concept of an open hypertext environment is now accepted as the best way to provide a flexible and extensible approach to the provision of hypertext functionality. Although very popular, the World Wide Web system created at CERN in Switzerland is a closed system, and suffers a number of drawbacks as a result. This paper describes how an open hypertext approach may benefit users and authors of World Wide Web material, and illustrates how the World Wide Web may be extended to provide a more open hypertext model.

Contents

INTRODUCTION

Current research into hypermedia systems has embraced the need to move from closed systems to open environments which provide a more flexible approach. Such systems are characterised by the separation of link structures from the information being linked. This approach makes it possible to address the need for reduced authoring effort in large-scale systems, and to make maintenance of the hypermedia information easier. The Microcosm system developed at the University of Southampton is an example of such a environment.

A parallel development in recent years has been the widespread growth of the World Wide Web (WWW)[Berners-Lee 92] into probably the most well-known hypertext system to date. While the WWW can be considered to be an open system, in that its protocols and format are publicly documented and available, in its current form it provides a closed hypertext system. In particular, its use of embedded link information can cause problems for the authors and maintainers of WWW documents.

However, these problems with the WWW are not fundamental to its operation, and are simply indicative of hypertext practice at the time it was conceived. The rapid growth has not allowed time for current implementations to reflect recent developments in the hypertext area. It is possible to augment the facilities of the WWW and provide open link service facilities to WWW users and authors. This paper describes our first experiences of applying the approach taken by Microcosm to a WWW environment.

COMPARISON OF THE TWO APPROACHES

The Microcosm open hypertext system is a result of research into open hypertext systems at the University of Southampton that has been undertaken since 1990, and its architecture has been widely documented [ Fountain 90, Davis 92, Hill 93, Davis 94]. The Microcosm model separates the management of documents from the management of linking, with both of these tasks implemented in a modular and extensible way. Presentation of information is handled by a set of document viewers which may be dynamically configured and which can include third-party applications in addition to Microcosm specific applications. Similarly the linking of documents is implemented by a configurable set of filter processes. Each process provides a small part of the overall linking functionality, and the filters present may be varied to suit a particular requirement, or new filters may be created to provide alternative approaches to linking. The modular architecture allows users to easily extend the system to suit their needs, and incorporate their existing resources.

Although designed with large-scale hypertext in mind, the initial implementation of Microcosm was based on a personal workstation, restricting its use to single user or LAN-based workgroups. We are currently developing the model to operate in a fully distributed environment [ Hill 94, De Roure 94].

The WWW on the other hand was designed with distributed access facilities from the beginning. This is provided in a very simple manner by the use of a node addressing scheme which allows remote systems to be specified. The hypertext model implemented by the current generation of WWW tools however has a simple point-to-point linking model based upon embedded links. This approach has several disadvantages which affect the scalability and maintenance effort required for a large distributed corpus of information.

· Link fossilisation and decay. Because link information is incorporated into the actual documents, management of changing information becomes a significant overhead. As documents are moved, edited, or deleted, any document which refers to it must also be altered to reflect this change. As the context of the document becomes wider, this problem increases, as there is no way to determine which documents refer to others, which would allow notification of changes to be made.

· Dead-ends. Another problem is that of links to dead-ends. This occurs because links are embedded in documents, and they can only be applied to the WWW's native document format, HTML (HyperText Mark-up Language). Links cannot easily be followed from other document types, even when they are viewed by a WWW client.

· Author-led navigation. Because of the nature of the hypertext model of the WWW, i.e. embedded point-to-point links, exploration of the available information can only be author-led. That is, users can only follow links to documents which the author of the current document is aware of and considers to be relevant.

The separation of links from documents can aid in the management of large hypertext structures. For example, in Microcosm, details of all documents are recorded in a document database, and each is assigned a unique identifier. All linking is then carried out in terms of these identifiers. Thus, if a document is moved or renamed, details of this change are recorded once in the document database, and all links remain valid. The separate management of link information can also allow links to be made between documents in proprietary formats. Links made in this way can be retrieved by the presenting application and made available without affecting the source format of the document, and therefore ensuring that it can still be manipulated in its own right. Another advantage of separate link information is the ability to provide a selection of possible link structures which may be applied to the same source documents.

Microcosm in particular can offer additional advantages over the WWW when authoring material. For example, 'generic' links may be made, which apply to a particular text string wherever it occurs, rather than just where the link was originally authored. This can greatly ease the authoring of common links, such as those on names or keywords, which would normally need to be created for each occurrence. Similarly, the text retrieval facilities available in Microcosm can help identify potential links, without the need to search documents manually. As well as reducing authoring effort, these facilities allow reader-led navigation of the hypertext to take place. Rather than solely follow a trail of clues provided by the author, the reader is able to 'query' the hypertext to find links. In addition, links will apply to documents which the author has never seen if the content is relevant.

Clearly, the type of facilities provided by Microcosm can offer advantages to WWW authors and users. Similarly the wide availability of the WWW and its distributed functionality can offer an easy method of making material authored in Microcosm more widely available. The following sections describe the ways in which Microcosm and the WWW may be integrated, and outline the results of the initial work that has been carried out.

3. INTEGRATION OF MICROCOSM WITH THE WWW

Microcosm and the WWW can be integrated at a variety of levels. This section describes the range of approaches that are possible.

3.1. Microcosm Aware WWW Clients

The simplest way of combining Microcosm and WWW is to utilise WWW clients as Microcosm viewers. By creating a WWW client that is also Microcosm-aware, additional, open hypertext facilities may be overlaid on WWW material. WWW links may be followed as usual, but links can also be made to and from other Microcosm document types, e.g. spreadsheets, word processor documents etc.

This approach has been used to incorporate common WWW clients such as NCSA Mosaic and Netscape into Microcosm environments. This has been carried out initially using the Microcosm universal viewer [Davis 94] , and we are also investigating the development of a fully Microcosm-aware client.

3.2. Delivering Microcosm material using WWW

Another approach is to use Microcosm to author material, with all the advantages that its flexible linking strategies can bring, and then use the WWW as a simple way of making this information available. All the necessary information to create WWW documents can be extracted from the Microcosm document manager and link databases.

This approach allows material to be easily created and maintained, but also allows it to be disseminated widely and easily. Further discussion of our experiences with this approach are presented in the next section.

3.3. Implementing an open model for the WWW

A third approach is to provide the facilities of Microcosm (separate linkbases, reader-led link following, etc.) wholly within a WWW environment. The ability of the WWW to execute specified scripts allows link databases to be implemented and accessed via a standard WWW server. In the same way that, in Microcosm, messages are routed through the current set of link processes, the WWW version sends HTTP requests, via a CGI script to a set of processes running on the server. These processes attempt to satisfy link requests, for example by searching link databases, or by matching against external resources such as dictionaries and manual pages. The CGI script returns the results obtained from these processes as a reply to the original HTTP request, in the form of an HTML document.

This architecture must be matched by a Microcosm-style interface for the client, which allows the user to make arbitrary selections in a document, then choose an action to carry out on that selection from a menu (e.g. Follow Link). This interface has been provided as an additional utility which can be used in conjunction with various popular WWW clients. The utility creates an HTML request based on the selection and action, and causes the client to send this request to a predetermined WWW server. The result of the request is then displayed by the client. This document will typically list a number of possible links from the chosen selection as a set of HTML buttons.

4. USING MICROCOSM TO AUTHOR WWW DOCUMENTS

As part of the investigation of how Microcosm may provide improved authoring of WWW documents, a simple tool to convert Microcosm material into HTML has been developed. This section describes our experiences with this tool, and the issues which its use has highlighted.

4.1. mcm2html: Microcosm to HTML conversion.

The mcm2html tool can be used to translate hypertext structures authored using Microcosm into HTML documents. This tool accepts a list of text documents and a link database as its main arguments, and for each document interrogates the database to find applicable links. For specific links (those authored so as to apply only in a certain document), this is a simple database query. For generic links, which are bound only to a particular word or phrase, locating valid links is harder. This can be done by querying the database for each word in a document in turn, but this is a very time-consuming process. The current version of mcm2html builds a hash-table of the available generic links to speed the process up significantly. Words in the document can be checked against the hash-table much more quickly than against the whole database.

Once the applicable links are established, the tool creates an HTML file and streams the text document to it, interleaved with HTML structuring information, and the links that have been found. The links are formed by combining the destination document name with a url 'stem' supplied to the tool as a command line argument. This identifies the server which will provide access to the resultant documents.

Some additional provisions must be made to ensure that all HTML requirements are met. For example, HTML does not permit links to multiple destinations, but Microcosm does, and this must be taken into account when creating the HTML version of Microcosm material. This is currently done by creating an intermediate page with the available links listed. Alternatively the available links could be listed in the source document. In addition, the tool must attempt to identify paragraphs in the original text, so that WWW clients can parse and format the resultant document correctly.

4.2. Features

As well as links between the specified text documents, links may of course be made to any other document type, such as graphic formats etc. In particular, Microcosm supports links to existing WWW documents, allowing material created in this way to form a full part of existing structures. There can be a problem if the document formats used are not understood by a particular WWW client, but, in most cases, this can be handled by the use of suitable 'helper' applications.

The main advantage of this approach is the reduction in maintenance effort, and the ease of re-use of material. If the details of link structures need to be updated, or altered, this can easily be done by adding new links to the Microcosm material and 'recompiling' the HTML documents. Similarly, if alternative link structures are required, these may be maintained in various linkbases. Then, if the contents of documents is to be changed, this can be performed once on the original documents, and the HTML versions recreated. In a purely HTML-based situation, these changes would have to be made to each alternative version, greatly increasing the effort involved and the chance of mistakes being made.

Another benefit is improved support for group authoring of hypertext material. Microcosm is able to support multiple access to a document set when working on a network, and offers facilities that make navigation of the available material much easier. This is a great benefit when material is being created by a number of authors. In addition, by utilising a pre-agreed approach to authoring, the links created may automatically highlight relationships between documents written by different authors. For example, authors can 'keyword' their documents by making appropriate generic links to them. Thus, documents created by different authors automatically become linked when keywords appear in the text. This implicit incorporation of cross-referencing is hard to provide when authoring HTML documents directly as all possible link destinations must be known in order for the appropriate links to be encoded. This behaviour is enhanced if the co-authors agree an appropriate vocabulary for use when linking. These results have been verified during initial experiments with a small group of authors.

4.3. Limitations

The disparity between the linking models of Microcosm and WWW can highlight some limitations that are not yet addressed by the conversion tool. This section briefly outlines some of these limitations.

4.3.1. End points

In WWW, links to an offset in a document, rather than the whole document, require that the destination be denoted by HTML mark-up. Attempting to incorporate such mark-up into the results of the conversion tool greatly increases the complexity of the process. Rather than treating each document individually, the entire set of documents must be processed before creating the HTML, so that all link sources and destinations can be determined. This also means that if a single document is edited or the link database changed, the entire document set may need to be recompiled in case new destination points must be inserted in existing documents. This restricts the Microcosm facilities that may be used, but is of less consequence for the WWW version of the documents, since the requirement to mark destinations means such links are not often used in other WWW documents.

4.3.2. Document Formatting and Source Formats

The other main problem is the limitation of using text as a source format. This makes the conversion process easier, but restricts the amount of structure information that can be provided in the resultant HTML document. Although a degree of structure can be indicated within the text, it is difficult to interpret this in the conversion process. If structure is necessary, it is possible to enter it directly in the source documents. This is then directly copied to the HTML documents by the conversion tool. An alternative approach would be to use a structured document format that Microcosm can handle, such as rich text format (RTF). However, this increases the processing required of the converter.

5. FURTHER DEVELOPMENTS

We are continuing to develop the tools described in this paper, in particular we have recently begun a small trial of group authoring of WWW material using Microcosm and the mcm2html converter. The initial results of this project have been encouraging. The development of a Microcosm-style architecture within the WWW environment is also ongoing. There are a number of areas which we hope to investigate as these tools develop, which are described briefly below.

To improve the WWW authoring facilities offered by Microcosm, an HTML editor could be developed which works in conjunction with Microcosm. The editor could provide all structure management for the source documents, whilst using Microcosm to provide efficient link management. This would overcome the problems with limited structure in text-based source documents.

One possible extension of the mcm2html converter is to allow it work in real time. This means that it could act as a gateway between an active Microcosm system and WWW clients. Thus the WWW view of the available documents is always up to date with the current Microcosm version. This is not the case at present as the WWW view must be actively 'compiled' from the Microcosm 'source'. Another benefit of this approach is that the use of alternative link structures could be enhanced. By providing some form of interface to the underlying link service offered by Microcosm, WWW users would be able to adjust the links being offered to them to suit their current requirements. For example, a user unfamiliar with the subject matter might use a dictionary linkbase, whilst an `expert' would not need this facility and could turn it off.

With the ability to provide full WWW access to an underlying link service in this way, it would then be possible to develop full collaboration between the Microcosm and WWW environments. With Microcosm operating as a link service in conjunction with a WWW server, and the Microcosm-style user-interface facilities for WWW as described in section 3.3, a fully inter-operative environment could be created. Links could then be authored and followed from any WWW document into material stored and managed by Microcosm.

6. CONCLUSIONS

This paper has described the range of possibilities for integration of Microcosm and WWW, and shown how the authoring facilities and architecture of Microcosm may be of benefit when developing WWW documents. By separating the management of links from documents, the problems of editing and updating hypertext documents is reduced. The facilities of Microcosm are particularly useful when group authoring of WWW material is being undertaken.

In addition, by providing facilities for a more exploratory, reader-led navigation of WWW material, the user is able to browse the available information in any way appropriate to their current needs, rather than the path chosen by the author of the material. It also allows WWW material to be augmented by the flexible and configurable hypertext services that Microcosm offers.

The integration of Microcosm with WWW in this way allows the normally closed environment of WWW to incorporate features desired of open systems. This has benefits for both the author and user of WWW documents.

6. REFERENCES

[Berners-Lee 92] T. Berners-Lee, R. Cailliau, J.-F. Groff, B. Pollermann, "World Wide Web: An Information Infrastructure for High Energy Physics", in Proceedings of the Workshop on Software Engineering, Artificial Intelligence and Expert Systems for High Energy and Nuclear Physics.

[Davis 92] H. Davis, W. Hall, I. Heath, G. Hill, R. Wilkins, "Towards an Integrated Information Environment with Open Hypermedia Systems", in ECHT '92, Proceedings of the Fourth ACM Conference on Hypertext, Milan, Italy, November 30-December 4, 1992, ACM Press, 181-190.

[Davis 94] H. Davis, S. Knight, W. Hall, "Light Hypermedia Link Services: A Study of Third Party Application Integration", in Proceedings of the Sixth ACM Conference on Hypertext, Edinburgh, Scotland, September 1994, ACM Press, 41-50.

[De Roure 94] D. De Roure, G. Hill, W. Hall, L. Carr, "A Scalable, Distributed Multimedia Information Environment", to be published in proceedings of Mediacomm `95.

[Fountain 90] A. Fountain, W. Hall, I. Heath, H. Davis, "Microcosm: an Open Model With Dynamic Linking", In Hypertext: Concepts, Systems and Applications. Proceedings of the European Conference on Hypertext, INRIA, France, November, 1990, 298 - 311.

[Hill 93] G. Hill, R. Wilkins, W. Hall, "Open and Reconfigurable Hypermedia Systems: A Filter Based Model", Hypermedia, 5(2), 1993.

[Hill 94] G. Hill, W. Hall, "Extending the Microcosm model to a Distributed Environment", In Proceedings of the Sixth ACM Conference on Hypertext, Edinburgh, Scotland, September 1994. ACM Press, 32-40.

ACKNOWLEDGEMENTS

This work has been carried out with support from EPSRC grant number GR/K36409