Over the past few years the number of people accessing the Internet and the quantity and variety of resources available through this medium has increased dramatically. To enable easier access to these information stores systems have been developed to partially automate the location and retrieval of any required part of this data reserve. These utilities can, at present, be used in conjunction with existing hypermedia systems as peripheral parts rather than as an integrated item. This paper will discuss these systems and investigate methods by which they can be used, and how they may increase the effectiveness of hypermedia systems, such as Microcosm, if they can be made an integral part of such software environments.
As a result of this increased utilisation the volume of information available through this world-wide network has now reached hundreds of terabytes, the American Library of Congress holds approximately twenty-five terabytes in its archives alone [Stein, R.M. (1991)].
The time taken for a user to browse such vast tracts of data would be unacceptable to all but the most fool-hardy, so to try to enable Internet users to find the resource that they require services are being developed that give them a simple interface to locate and retrieve resources.
Whilst the Internet has been burgeoning there has been extensive interest in the fields of hypertext and hypermedia, and although it is not a recent idea, the area where these two developing technologies meet is certainly an exciting one. An early exponent of the use of hypertext together with data storage and retrieval techniques was Nelson with project Xanadu [Nelson, T.H. (1988)], other attempts include KMS, based on the ZOG system developed at Carnegie-Mellon [Akscyn, R.M., McCracken, D.L., Yoder, E.A. (1988)] and Intermedia developed at Brown University's Institute for Research in Information and Scholarship (IRIS) [Yankelovich, N., Haan, B.J., Meyrowitz, N.K., Drucker, S.M. (1988)]. The main difficulties with these systems is that they used a certain amount of 'mark-up' within the documents, therefore the original integrity of the document was lost. Although Intermedia held the links separately, the documents were still marked to indicate link positioning. Systems such as Microcosm hold the links separately so there is no alteration to the document, which allows links to be placed in documents where the author has read-only access.
At present Microcosm can be used with existing Internet Resource Discovery Systems by adding them as viewers to the system [Hill, G., Wilkins, R., Hall, W. (1992)]. This scenario makes it difficult for the user to link the information being accessed in the hypermedia environment with the Internet resource bases being queried. For example a user could not select a piece of text within the hypermedia environment and automatically query an Internet resource. There is a need therefore that such resource discovery systems become an inherent part of any hypertext system that is going to be of more than cursory usefulness within the wider context of the Internet.
Figure 1 : The logical structure of Alex.
Figure 2 : The Basic Archie architecture
The basic Archie architecture is shown in figure 2, as can be seen from the diagram there is more than one Archie server, currently there are thirteen replicated servers around the world and the user can choose the site that is geographically closest to them. There are three possible ways by which Archie database can be accessed; these are telnet, e-mail and more recently the Prospero interface (Prospero is covered in more depth in section 2.5). The telnet interface has been found to be rather intensive on the servers resources so the other two methods are preferable from the point of view of the site at which the server is located. To try to maintain consistency between the database strewn over the world there is a central database in Montreal, Canada which regularly checks the FTP sites. The other sites update their databases from this master. It has been estimated that fifty percent of all Internet traffic to and from Montreal is directly related to the Archie update mechanism.
Figure 3 : Gopher.
As mentioned previously the Gopher service is based upon a hierarchy of information and the root of this tree is stored at the University of Minnesota on the host rawBits.micro.umn.edu. This is the default directory that is retrieved by a Gopher client when first invoked. It is possible, however, to alter this default directory to one that is more applicable to the user's requirements. For example it would not be sensible to set the default directory to one that is stored on a machine in New Zealand if the user were located in Southampton; instead it would be much more sensible to use the directory that is stored on host gopher.ed.ac.uk .
The Gopher architecture allows for a hierarchy of servers so that there could be a top-level server for an organisation and then various lower-level servers for the departments within the organisation. This allows the user to gradually hone the search until the required resource is located. The service is available in two forms, the first is a series of menus through which the user navigates picking entries of interest so that the Gopher client can retrieve the next level of the menu structure until eventually the information is found. The second method is a full-text search implemented by utilising special Gopher search servers which hold full-text inverted indices of subsets of the documents stored in the a Gopher server. A Gopher search server can be set up to index more than one normal Gopher server so that any particular logical area, i.e. field of interest, can be covered by one search server even though the documents may not necessarily be located on any one server. Recent Gopher clients also allow access to information stored on WAIS, Archie and FTP servers as well as the Gopher servers.
* Provide users with a uniform, easy-to-use, location transparent mechanism to access information.
* Allow a user at a workstation to catalogue and view information from a large number of sources. [Kahle, B. (1989)]
The WAIS model is based on the typical client-server design and is shown diagrammatically in figure 4.
Figure 4 : WAIS Client-Server Design.
Each server keeps a complete inverted index of all the documents within its database and hence can use a full text retrieval system when a query is lodged with the server. The server can then respond with a set of all the relevant documents which are selected from the database using a word weighting algorithm to find the best matches. The set can also contain the names of other servers that have registered with the server being queried. This is unlikely unless the query was directed at the directory-of-servers server with which all WAIS servers must be registered for them to be publicly accessible. WAIS can therefore be seen as a set of decentralised indices all being accessed transparently to the user.
The client application displays the set of matched documents, which may be any format, (e.g. postscript, text, graphics, animations, etc.), The user selects the required document(s) to be retrieved from the database for display. If a particular document proves to be especially interesting the user can utilise a feature called relevance-feedback, this enables the user to select a document, or section of a document, and to re-run the query so that other documents similar to the one selected are also returned in the set. This selection process ranks the documents in terms of the number of words in common.
The protocol used for communication between the client and server(s) is an extension of the NISO Z39.50 protocol [Lynch, C. (1991)] so that other services that also wish to communicate with the WAIS servers have a standard with which they can conform to ensure compatibility.
In the brief time that the WAIS project has been running it has already proved to be quite a success. There were over 225 publicly registered databases as of June 1992 and over 6000 hosts with an estimated 10,000 users accessing those servers, each of which have a "specialised subject".
"The World Wide Web initiative encourages physicists to share information using wide-area networks." [Berners-Lee, T.J., Cailliau, R., Groff, J.-F., Pollermann, B., (1992a)]
The Web allows 'pages' of information to be displayed and within these pages there are hypertext links to other pages within the system. The documents at these end points need not necessarily be on the same server as the document from which the link originated but this is all transparent to the user. New pages can be added to the system and then links made from existing pages that are relevant to the new addition. Links can also be made from the new page to existing documents. This means that the user can browse through this environment following any links that they find interesting and possibly finding that new links have been added since their last visit to a particular document.
Thus this model merges the techniques of information discovery on the Internet and hypertext. The user has no need, and in most case no wish, to know the underlying mechanics when a link is followed or where the information is coming from. Instead they are interested in the content of the information. It can therefore be said that the World Wide Web organises the information available via the Internet into a distributed hypertext model with a client application running on the users/browsers machine and various servers around the globe providing the information required.
When the original idea of the World Wide Web was being considered it was decided that having a purely hypertext based system would not be flexible enough for all tasks that would be undertaken, since in quite a few instances it would not be obvious which of the hypertext links to follow to find the particular information. To this end the system was designed and built with two separate discovery models available :
* one based on the hypertext paradigm of following links from highlighted sections of text.
* the other based upon the flat search paradigm for accessing indices in the information space.
The benefits of adopting both these approaches is that it gives the World Wide Web user access to other Internet resources that cannot be easily formatted into hypertext form, such as Gopher servers, WAIS databases, Network News groups, and anonymous FTP sites as well as the World Wide Web servers. This, together with the architecture of the World Wide Web is shown graphically in figure 5.
Figure 5 : World Wide Web Architecture.
When a client application is first being installed a default cover page can be specified which will be retrieved and displayed whenever the application is started. There is a standard front page available on the CERN server and this gives access to the three discovery trees currently supported by the World Wide Web, the three trees are :
* Classification by subject/server type
* High-energy physics (as this was the field that the World Wide Web was originally set-up to support. It features prominently in the information stored on the system, especially the CERN server)* Classification by organisation
To allow links to be embedded within the documents accessible by the World Wide Web a form of SGML (ISO 8879:1986) is used, called Hypertext markup Language, or HTML. Markup is used to indicate the position of a link in the document and also the page to which it is linked. The description of the end point of the link is specified using a Unique Resource Locator (URL). These are discussed later in this paper. If the user wishes to follow a link they simply click with a mouse button on the area of highlighted text. The document at the end of the link is then retrieved, using a Hypertext Transfer Protocol (HTTP). A new protocol was used to give World Wide Web servers features that were not available via existing protocols with adequate performance for following hypertext links.
On the whole the idea that "one view encompasses all systems" [Berners-Lee, T.J., Cailliau, R., Groff, J.-F., (1992b)] seems to have been reasonably successful.
The Internet resource discovery systems that the rest of this paper will concentrate on are the WWW and WAIS. However the techniques discussed here are applicable to all of the discovery systems covered previously, but not to systems, such as Prospero, that organise the user's view of the Internet. Systems such as Prospero would have to be implemented at an operating system level. The hypermedia system would therefore automatically use such systems directly.
At the University of Southampton in the Image and Media Lab. an open hypermedia system, called Microcosm [Fountain, A., Hall, W., Heath, I., Davis, H. (1990)i], [Davis, H., Hall, W., Heath, I., Hill, G., Wilkins, R. (1992)] has been developed and it is this system that will be considered in the remainder of this paper.
If one of the previously mentioned discovery systems were to be added to Microcosm in their "raw" state then it would have to be as a viewer, because a filter must be able to accept Microcosm messages, act upon them if need be, and then pass the messages on to the next filter in the chain. None of the Internet systems has any degree of tailorability so it would not be possible for them to accept an incoming message nor to send the message to the next filter.
The different Microcosm viewers fall into one of three categories of Microcosm "awareness". Specially written viewers such as the text viewer are fully aware so they can interact with the rest of the Microcosm system on all levels. The next tier down are the partially aware viewers. These are usually mainstream applications that have some degree of programmability included so they can be altered to understand some of the Microcosm messages and interact with the system to some extent. The lowest level is that of unaware viewers. For example Windows notepad cannot be altered at all to use the standard Microcosm messages but it can be started by Microcosm with a specific document. To pass information out to the hypermedia system notepad must rely upon Microcosm monitoring the clipboard for any changes and then the appropriate action can be taken. It is into this last group that all the existing resource discovery systems are cast, because Microcosm does support external applications it makes it a reasonably simple task to use programs that are unaware. This means that the author could make a link from a piece of text, or an area of a bitmap, to the discovery system so that when the hypermedia browser follows the link the discovery system would be started.
As mentioned earlier there is no possibility of two way communication between Microcosm and the Internet resource discovery system so the resources thus discovered would not be directly available to the hypermedia application. They would have to be saved using the discovery system and then imported into Microcosm which makes the whole operation rather circuitous. Another problem with this approach is the lack of a common interface between the hypermedia system and the discovery system so that it would be all too easy for the browser or author to become very confused between the two systems. It would be much better if the two systems were properly integrated.
This raises some interesting problems :
* How to locate the resource ?
* How to retrieve the documents ?
* How to display the documents ?
The technical method of solving each of the above problems is covered by their various protocols. The main question is the point at which the different operations should be implemented within Microcosm. As intimated in the previous paragraphs a new filter would have to be written to locate suitable resources that might hold relevant documents. A first attempt at such a filter is currently in progress and is based upon the WAIS discovery methodology. Once documents have been found they then need to be retrieved from the remote server for display on the local machine. The logical place for the document retrieval functionality would be in the Document Management System (DMS) portion of Microcosm, which could perform the necessary transportation tasks to make a copy of the document on the local machine. Once this had been completed a message would be dispatched to the appropriate viewer indicating the new document to be displayed. In most cases the documents are purely textual so the standard Microcosm text viewer could be used but if the system were to be widened to allow the system to access other systems such as World Wide Web and Gopher new viewers would need to be written to cope with the specialised document structures and layout.
If the World Wide Web system were to be fully integrated into Microcosm then not only would a new viewer have to be written to cope with the HTML format that World Wide Web documents use but it would also have to be able to extract the linking information contained within these documents. This would have to then pass the information on to the DMS which would retrieve the document specified as the end point of the link.
The extensible nature of Microcosm allows new ideas such as these to be seamlessly integrated into the system so that the author/browser can interact with the system in the usual manner with no knowledge that the document may be coming from a distant server or from the local hard disc. Existing systems require a plethora of applications to discover new resources, link them into the hypermedia application and browse them. Integrating resource discovery into Microcosm gives the user a consistent interface with which to work, hence lowering the cognitive overheads imposed by many different applications. The user can devote more intellectual effort to the content of the hypermedia application so increasing productivity and the applicability of Microcosm to all fields.
Another benefit of such a strategy is that it would allow the browser more flexibility in exploring the subject area of the hypermedia application. If a new aspect of the subject occurred to the user as they were browsing/exploring the system then related documents could be located and built into the users personal view of the system for future reference even if the original author had not thought to explore that particular avenue.
The SERC funded SuperJANET project promises to have a pervasive network between institutions that can deliver data at speeds in the range 10Mbs to 155Mbs. When the network is in place the long vaunted promise of digital video and sound deliverable over networks will be truly possible. The scope of SuperJANET will not be as far reaching as the Internet but will still allow UK institutions to interchange and access remote hypermedia applications in reasonable time frames. The quantity and diversity of resources available to the author/user will blossom so making the task of locating the required documents even more troublesome than at present.
It has been shown by projects such as WWW and WAIS that discovery systems are a valuable addition to the tools available to the user. The number of people choosing to use them indicates this. If a unified system could be produced under which many of the different methods could seamlessly operate the popularity of them would dramatically increase.
Also connected to the ideas outlined in the paper is the possibility to control a Microcosm session remotely. This would be a particularly useful aid for tutors, enabling them to demonstrate particular aspects of an application that they feel are important to the students. The first version of this will be written to work over a local area network but eventually it should be possible to alter the software so that the remote machines can be located anywhere that there is a network connection.
This paper has presented ideas for the integration of these services with open hypermedia systems, such as Microcosm, so that industrial strength hypermedia systems can be created and utilise the entire gamut of resources available via the world's networks. This will enable a richer environment to be constructed in which to build hypermedia applications and also enhance hypermedia's applicability to more areas of knowledge.
Also with the speed of data transmission over the global network ever increasing it will soon be possible to have a central store of digital video, sound, etc. and then deliver it on request in real-time over the network, although the band width required would be rather high. Collaboration on a massive scale will become a possibility with such networks, allowing for a much broader base of available applications. It is imperative, therefore, that discovery tools such as those mentioned in the body of this paper be incorporated into hypermedia systems as soon as possible. Thus allowing users to concentrate on the more important, and interesting, task of creating the application as opposed to finding the material with which to construct it.
It would, however, be wrong to suggest that Internet Resource Discovery systems are a panacea to the difficulties of locating resources. It is still extremely difficult to locate suitable diagrams for a particular topic because there is no universally accepted classification system for pictures. Research is continuing in these areas so in the not too distant future automatic location of pictures and digital video should also be a possibility, so allowing truly global hypermedia applications to be produced.
Alberti, R., Anklesaria, F., Lindner, P., McCahill, M., Torrey, D., (1992), "The Internet Gopher protocol : a distributed document search and retrieval protocol", On-line documentation, Spring
Berners-Lee, T.J., Cailliau, R., Groff, J.-F., Pollermann, B., (1992a), "World Wide Web : An Information Infrastructure for High-Energy Physics", Proceedings International Workshop on Software Engineering and Artificial Intelligence for High Energy Physics, La Londe, France.
Berners-Lee, T.J., Cailliau, R., Groff, J.-F., (1992b), "The World Wide Web", Computer Networks and ISDN Systems, Vol. 24, No. 4-5, pp. 454-459.
Berners-Lee, T.J., (1993), "Unique Resource Locators", Internet Draft, IETF URL Working Group, Expires September 30, 1993.
Cate, V., (1992), "Alex - A Global Filesystem", Proceedings of the Usenix File Systems Workshop, pp 1-11.
Danzig, P.B., Ahn, J., Noll, J., Obraczka, K., (1991), "Distributed Indexing : A Scalable mechanism for Distributed Information Retrieval", Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, October, pp. 220-229.
Danzig, P.B., Li, S.-H., Obraczka, (1992), "Distributed Indexing of Autonomous Internet Services", Journal of Computer Systems, Vol. 5, No. 4.
Davis, H., Hall, W., Heath, I., Hill, G., Wilkins, R., (1992), "Microcosm : An Open Hypermedia Environment for Information Integration", ECHT '92, Milan, December, pp. 181-190.
Emtage, A., Deutsch, P., (1992), "Archie - An electronic Directory Service for the Internet", Proceedings USENIX Winter Conference, January, pp. 93-110.
Fountain, A., Hall, W., Heath, I., Davis, H., (1990), "MICROCOSM : An Open Model for Hypermedia with Dynamic Linking", Hypertext : Concepts, Systems and Applications. The Proceedings of The European Conference on Hypertext, INRIA, France, November.
Hill, G., Wilkins, R., Hall, W., (1992), "Open and Re configurable Hypermedia Systems : A Filter-Based model", Computer Science technical Report, University of Southampton, UK, CSTR 92-12.
Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., West, M., (1987), "Scale and Performance in a Distributed File System", ACM Transactions on Computer Systems, Vol. 6, No. 1, Jan., pp 51-81.
Kahle, B., (1989), "Wide Area Information Server Concepts", Thinking Machines Technical Memo DR89-1, Cambridge, MA : Thinking Machines Corp.
Li, Z., Hall, W., Davis, H., (1992), "Hypermedia links and information retrieval", Proceedings of the 14th British Computer Society Research.
Lynch, C., (1991), "The Z39.50 Information Retrieval Protocol : An Overview and Status Report", Computer Communication Review, ACM SIGCOMM, Vol. 21, No. 1, pp. 58-70
Nelson, T.H., (1988), "Managing Immense Storage", Byte, Vol. 13, No. 1, pp. 225-238
Neuman, B.C., (1992), "Prospero : A Tool for Organising Internet Resources", Electronic Networking : Research, Applications, and policy, Vol. 2 No. 1, pp. 30-37.
Schwartz, M.F., Emtage, A., Kahle, B., Neuman, B.C., (1992), "A Comparison of Internet Resource Discovery Approaches", Computing Systems, Vol. 5, No. 4
Stein, R.M., (1991), "Browsing through Terabytes", Byte, Vol. 16, No. 5, May, pp 157-164
Yankelovich, N., Haan, B.J., Meyrowitz, N.K., Drucker, S.M., (1988), "Intermedia ; The Concept and Construction of a Seamless Information Environment", Computer, Vol. 21, No. 1, Jan., pp. 81-96