Using the World Wide Web as an Electronic Library

Les Carr, Hugh Davis, Wendy Hall and Jessie Hey,
Multimedia Research Group,
University of Southampton

Abstract

It is often supposed that an electronic library is synonymous with a database: a closed world with strict access to only the materials inside the database. The World-Wide Web, by contrast, is an open environment with many participating sites but no enforced control and therefore no guarantees of data permanence or consistency. In this paper we compare the advantages of the open- and closed-world views in terms of two systems which embody them, and describe an Electronic Libraries project which is taking advantage of the open perspective.

1. What is an Electronic Library

A library is commonly thought of as a mainly closed system: books enter it through an admissions system which records them and classifies them for cataloguing and borrowing. Criminal activity and accidents notwithstanding, each book is accounted for and can be located easily. People do not sneak into libraries and add books, nor are they allowed to reshelve them or withdraw them from stock. However, all these things are normal on the World Wide Web (documents appear and disappear ad libitum) since it provides universal distribution, but no central control.

In [Levy & Marshall] the assumption of a digital library being the repository of a fixed and permanent document collection is challenged. They argue that if the technology can accommodate fluid, revisable and even open-ended continuously-authored documents then these documents should surely find a home in a digital library. Also, from an archivist's point of view, semi-permanent and even ephemeral documents should be made accessible. This kind of argument presents a more open and dynamic environment.

In this paper we compare and contrast two approaches to information management systems as exemplified by two specific systems that, together with the WWW, could be useful for implementing some aspects of a digital library. These two approaches can be loosely summed up as 'open' and 'closed'. We go on to describe the "Open Journal Framework", a UK ELib project whose aim is to enhance the functionality of libraries of electronic publications by exploiting an open style of approach.

2. The Closed System approach

If a library is predominately a closed system, then a database is well-suited to modelling it in electronic form. Databases are transaction-based, multi-user systems which provide guarantees of data integrity and consistency of access. Hyper-G [Flohr, Schmaranz] (a research project from the University of Graaz, Austria and now a commercial product under the name HyperWave) is a hypertext environment which is implemented as a strictly controlled database of documents and links.

One of the main characteristic of Hyper-G which makes it a candidate for Electronic Library services on the Web is its guarantee of consistency: its undertaking to keep strict track of all documents and interdocument hypertext links which it handles.

Hyper-G has a superficially similar architecture to the Web: client browsers are provided documents by network servers, but unlike the Web the hypertext links (relationships between the documents) are stored independently. Hyper-G moves one step on from the Web by adding support for link maintenance and management, linking between different media types, different sets of links for different users, a docuverse, text retrieval and some visualisation tools for navigating around 'clusters' of related materials.

Each Hyper-G server maintains a document management system, which keeps the attributes of the documents on the server, a link database which maintains the links, and an information retrieval engine, which can retrieve on both the attributes of the document and also the full text content of the document. The servers themselves may be arranged into hierarchies underneath a world wide 'root' server, but the user connects directly to only one server. Hyper-G can also arrange to collect documents from other servers such as Web and Gopher servers.

The Hyper-G client browsers provide an interface for document and catalogue browsing, authoring and link creation, supporting a variety of standard text, picture, movie and 3D data formats.

Both within documents and between documents hypertext integrity is maintained by the authoring clients. Each document knows the id's of all the links it uses, and even though they are stored externally when a client loads a document it is also able to load all the links it requires. The client is then able to edit the document (or move it or delete it) without causing integrity problems, since at the client end all links are effectively embedded within the document.

3. The Open System approach

One of the principle perceived advantages of Hyper-G over the 'raw' WWW is this property of continued consistency with a dynamic and changing set of data. Recent discussions in the Electronic Library community [lis-elib] have focussed on this as a requirement for maintaining an evolving collection of journal resources and the advantages of Hyper-G for handling this data. However, there is one fundamental problem with the 'closed world' approach: how can you manage links to items outside your closed world? In a truly self-contained world, there would be no interaction with the 'outside', but electronic journals are a medium for scientific and technical discussion, the domain of which is "the literature". Inevitably "the literature" will consist of many more documents than are contained in any one "closed world". This is the revolution of the WWW itself: access to any document anywhere in the world.

To accommodate this using a closed system would require all the literature to be imported into the system (converted into the required format and have hypertext links added), or to simply ignore the links to the 'outside'. The former approach is likely to be very expensive, whereas the latter approach denies the advantage of link management to the majority of the links.

In this section we briefly describe Microcosm (a research system developed at the University of Southampton and now a commercial product) as used for managing local document resources and its successor, the DLS, a system which is being used for managing distributed information resources.

Microcosm: an open system for local information

By contrast to Hyper-G, a characteristic of Microcosm [Davis et al] which makes it a candidate for Electronic Library services is its ability to manage heterogeneous information in an open, i.e. unconstrained information environment.

Microcosm has a fundamental model of a group of co-operating processes communicating via message passing which together supply various facilities for an information environment. Its main features are

* a selection-action paradigm for user interaction. Fixed link anchors (or buttons) are simply an author's predefined binding of a particular selection within a document to a particular hypertext action (such as follow link). In general, readers of a Microcosm hypertext can invoke a range of hypertext actions on arbitrary selections.

* links held externally to the documents they reference. This allows links to be made between the native documents of third-party applications, such as wordprocessors, spreadsheets, databases or CAD packages.

* a message passing framework, into which various document viewers or hypertext servers may be slotted.

* a document manager which associates document ids with document locations and a set of other attributes (such as title, author, keywords, description)

In order to see how the components of Microcosm function together, consider how a link is followed. The user may make a selection in an open document in a word processor, and then chooses the menu action "Follow Link". The application packages the selection, its position within the document and the document's identifier into a message which is sent through the system. A link database intercepts the message, looks up any links that correspond to that selection, and returns a message containing a specification of those links, along with the original link request message (possibly to be intercepted by further link databases). Eventually, all the link specification messages are intercepted by a dispatcher, which presents the user with a dialog box containing descriptions of each of the applicable links. The user selects a link and the dispatch box sends a "Dispatch Link" message to the appropriate viewer. The viewer intercepts the message, opens the appropriate document and highlights the destination selection.

In this model links are resolved on the basis of the content of the object that the user has selected on the screen. This can be a piece of text, part of an image, an object in a CAD diagram, or a map reference in a GIS system . An action such as "follow link" or "compute link" is then attached to this selection and that information passed through the system. This is significantly different from the Hyper-G model where links are requested by id, rather than discovered by a dynamic process of computations. Also, the system architecture allows both in-house and third-party information processing tools to be incorporated into the system. A particular exploitation of this flexibility within an Electronic Library context is discussed in [Davis & Hey 1995].

DLS: an open system for global information

The DLS [Carr 95] is a development of the Microcosm philosophy (extensible links applied to documents in arbitrary applications) applied to a distributed environment (the World-Wide Web). In the same way that a client connects to a remote Web server to access a document, DLS allows the client to connect to a link server to request a set of links to apply to the data in a document. From an abstract viewpoint it provides a hypermedia link service which can be used alongside, but separate from, the WWW's document data service: in practice the link service is mediated by the WWW (in HTTP messages), and implemented by CGI processes located on Web servers.

Figure 1: A document viewer first requests a document and then some links

The provision of an independent link service is designed to allow any information environment to be augmented with hypermedia functionality, whether or not it provides link following facilities itself. The WWW, of course, has a well-established method for expressing links as attributes of its native document format, and so the link service will provide a complementary set of links on top of those standard facilities. By contrast, a simple text editor (such as Window's Notepad) has no built-in hypertext links, and so the link service provides an otherwise non-existent service to such users. Without a link service, Web users can follow links from HTML documents or `imagemapped' pictures into dead-end media such as spreadsheets, CAD documents or text; with the link service they can also follow links out of these media again.

End-users (readers or browsers) may choose to subscribe to this service by running a small interface agent which communicates with both the link service and the document viewer. For an information consumer on the Web, the link service provides an additional means of navigation that can be tailored very precisely to his or her exact needs.

When the user wishes to investigate links from some information, they select the data of interest and choose the Follow Link menu item from the interface agent. The agent grabs the current selection, tries to determine the current document context (which document was that selection made in? what was its URL? where in the document was the selection located?), parcels this information into a message which is sent to the link server. (This process actually consists of creating an HTTP message with POST data and sending it to a Web server, since the link service is actually hosted by the Web.)

The link server then responds with a set of links which are available from the specified selection in the specified document. These links are presented to the user in the form of a 'clickable' list of destinations, displayed as a page of HTML by the Web viewer.

Figure 2a: A user requests a link from the link service

Figure 2b: The server responds with a page of available destinations

As well as readers, authors may make use of the DLS by using the same interface agent. Since a part of the authoring process involves the author taking on the role of a reader, the author can benefit from the link service exactly as a reader can, but in addition an author can create links and edit link databases.

This kind of functionality is fairly straight-forward, but the real advantage for the author comes in the kinds of link definition that are allowed. Following the Microcosm model links may be declared to be more or less generic, i.e. having the location of the selected text constrained to appear more or less specifically within the static document context. A standard (or specific) link applies only at the exact place that the link source was selected, whereas a completely generic link will match the link source's selection at any place in any document. This facility allows the author to treat a link as a declaration which states "any place in such-and-such a document context that phrase `X' is mentioned links to this data", and allows the author to create a set of documents along with a set of links that can be used to `come to' the documents from other places as well as a set of links which `go to' other documents from the current documents.

The `come-to' link type leads to a resource-based authoring style in which an author can publish a largely standalone suite of documents, together with some link databases which define the `routes' into, through and out of the documents. Making use of the link service allows the author to `mix together' a number of these resources as the `into' links for each of them will act on the text of the others and bind them all together. In fact, the `into' links can act to bind the resources not just to each other, but to the larger Web of documents outside the author's control--the readers' environment. One of the major benefits of this authoring style is the scope for information reuse: not only can the author vary the internal paths through the documents by changing the link databases, but also the documents themselves can be used and reused in many different situations by providing different sets of `into' and `out of' links.

4. An Open Journal Framework

This section describes a project funded in the UK by JISC's Electronic Libraries programme. The aim of this "Open Journals Project" is to address some issues of the digital information environment which is currently being formed in academic libraries as an increasing proportion of their information assets are becoming electronic, available on CDROM or via the Internet.

A problem for users of library information services in Higher Education is the isolated and diverse nature of the electronic information resources. Although a user can (in theory) from the same terminal access many dozens of journals, databases and articles on subjects of interest, it is necessary to navigate a complicated path through many providers information gateways in order to locate any particular piece of information of (as yet) undetermined relevance.

The goal of the project is to develop a framework of information retrieval technologies and electronic publishing practises to be used by information providers (especially journal publishers) which will allow them to make their publications available not as isolated, one-off resources, but as co-operating assets within an information delivery environment such as a library at an institution of Higher Education. To achieve this goal we aim to establish novel ways of seamlessly integrating journals that are available electronically over the network with other journals and information resources that are also available on the network, thus using the capabilities of the Distributed Link Service to realize the concept of the 'open' journal.

One of the major features of the DLS which helps this goal to be achieved is the use of generic links which enable the resource-based authoring paradigm described above. It is this facility that allows a journal to be published with a set of link databases that provide links

around a journal archive, either navigational links connecting issues and volumes together or subject-specific links providing an index into the collection
out of the archive, connecting to online subject databases or other journal articles from bibliographic references
into the journal, which attract readers in from key words or phrases that appear in other journals or documents, available from other servers.

It is this last capability which is perhaps the key to allowing journals to interoperate, because it allows the creation of links between two third party resources, not just to another publisher's documents, but from them as well.

Figure 3: The catalogue for an Open Journal of Biology

The concept of an Open Journal then is of a 'super journal' which consists of material from many individual journal, document and database resources, tied together by databases of links. The project is currently attempting to demonstrate this concept by producing an Open Journal of Biology, whose catalogue is seen in Figure 3. It consists of journals from a number of different publishers, served from a number of different sites in a number of data formats.

5. Conclusion

Electronic libraries will not be static and unchanging but dynamic environments capable of managing an increasing flow of documents, many of which will themselves be dynamic and revisable. For such an environment, closed-world systems like Hyper-G can provide certain electronic library services (cataloguing and hypertext links) with guarantees of data consistency inside a closed information universe for a limited set of information types, whereas open-world systems like the DLS can provide extensible electronic library services for documents of any type, maintained within any environment but without any hard and fast guarantees.

In [Levy], the description of the library as a static, closed system is challenged, since real world collections are subject to 'crumble', i.e. decay over time. So catalogues (as well as the documents they describe) require constant maintenance, without which consistency cannot be guaranteed. Perhaps we could say that Hyper-G would emphasise a consistent, controlled approach to library management, whereas Microcosm would lend itself to a more 'libertarian' approach. In the real world it is likely that neither approach is sustainable in its pure form, but what mixture of philosophies is required for a digital library is as yet unclear. [Carr96] reports on research to attempt to provide a mixture of Hyper-G-like consistency checking in combination with the DLS open environment.

6. Bibliography

T. Berners-Lee, R. Cailliau, J.-F. Groff, "The World-Wide Web", Computer Networks and ISDN Systems, 24(4-5), 454-459

L. Carr, D. De Roure, W. Hall, G. Hill "The Distributed Link Service: A Tool for Publishers, Authors and Readers", The Web Revolution: Proceedings of the Fourth International World Wide Web Conference 1995

L Carr, H Davis, D De Roure, W Hall and G Hill, "Open Information Services", Proceedings of the Fifth International World Wide Web Conference, Elsevier, 1996

H. Davis, W. Hall, I. Heath, G. Hill, R. Wilkins, "Towards an Integrated Information Environment with Open Hypermedia Systems", in ECHT '92, Proceedings of the Fourth ACM Conference on Hypertext, Milan, Italy, November 30-December 4, 1992, ACM Press, 181-190.

H. Davis, J. Hey, "Automatic Extraction of Hypermedia Bundles from the Digital Library", Proceedings of the Second Annual Conference on the Theory and Practise of Digital Libraries, 87-96, 1995 <URL: http://csdl.tamu.edu/DL95>

U. Flohr, "Hyper-G Organises the Web", Byte Magazine, 20(11), 59-64, November 1995

W. Hall, L. Carr, H. Davis, R. Hollom, "The Microcosm Link Service and its Application to the World Wide Web", in Proceedings of the First WWW Conference, Geneva.

S. Hitchcock, L. Carr, W. Hall " An Open Journal Framework: Integrating Electronic Journals with Networked Information Resources", ELIB Project <URL: http://journals.ecs.soton.ac.uk/flyer.html>

D. Levy & C. Marshall, "Going Digital: A Look at Assumptions Underlying Digital Libraries", Communications of the ACM 38(4), 77-84, ACM Press, April 1995

D. Levy, "Cataloging in the Digital Order", Proceedings of the Second Annual Conference on the Theory and Practise of Digital Libraries, 31-37, 1995 <URL: http://csdl.tamu.edu/DL95>

lis-elib, "Link Integrity", a thread of the Mailing List for the Electronic Libraries Programme, archived at <URL: gopher://nisp.ncl.ac.uk:70/1m/lists-special/lis/lis-elib/archives/1996-03>

K. Schmaranz, "Hyper-G and Electronic Publishing", in "Hyper-G. The Next Generation Web Solution", H. Maurer (Ed), Addison-Wesley, 1996.

7. The Authors

Leslie Carr is a Senior Research Fellow in the University of Southampton's Multimedia Research Group. His research interests include the application of HyTime, SGML and DSSSL to hypermedia applications, in particular WWW , and he menages the Open Journal Framework project and the ongoing development of the Distributed Link Service.

Hugh Davis is a lecturer in Computer Science at the University of Southampton, UK, and was a founder member of the multimedia research group. He was one of the inventors of the Microcosm open hypermedia system, and is manager of the Microcosm research laboratory. His research interests include data integrity in open hypermedia systems and the application of multimedia information retrieval techniques to corporate information systems and to digital libraries.

Wendy Hall is a Professor of Computer Science at the University of Southampton. She is variously a Director of the Multimedia Research Group, the University's Interactive Learning Centre and the Digital Library Centre, researching into multimedia information systems and their application to industry, commerce and education.

Jessie Hey is a chartered librarian/information specialist and qualified teacher who has worked in a variety of library/information roles at California Institute of Technology, CERN and Southampton Institute of Higher Education. This was followed by 12 years at IBM's UK Development Laboratory where her jobs included managing the technical and business information services and setting up an interactive learning centre. She is now pursuing postgraduate research with the Multimedia Research Group at the University of Southampton.