[Top]

[Back]

[Next]

[Bottom]


2 Distributed Information Systems


"A distributed system is one in which I cannot get something done because a machine I've never heard of is down" - Leslie Lamport

2.1 The Nature of Distribution of Resources

The general trend of down-sizing that has taken place over the past fifteen years has had a significant impact on the way electronic information is organised and manipulated. Services, processing and data have moved from the traditional central mainframe of the 1970s into intranetworks of organisation-wide workstations and personal computers. Information is arranged, accessed and processed at a level that is more local to the user.

Therefore, it can be argued that the concept of distribution has been an inevitable evolution in electronic information development; information and user bases are growing at such a rate that centralised storage and processing is no longer feasible. However, what has taken place is the creation of smaller, centralised islands which provide local services and are connected to other service providers through networks and networking protocols.

A number of technologies have been developed over the years, especially for the Internet, to address the storage and access issues associated with distributed information. These distributed information systems attempt to coordinate information (called resources) at a server or site level and to provide mechanisms to allow the user to browse and manipulate this information.

The use of the term distributed in the context of networked resources, however, implies distributed access to information and possibly the physical distribution of data across a number of machines, rather than the distribution of computation or processing.

2.2 Distributed Information Systems

Historically, the protocols that were initially developed during the mid 1970s attempted to provide access to sets of collective information across a network. For example, the File Transfer Protocol (FTP) (Postel et al., 1985), allows users to transfer files across the network from remote machines to their own. However, this can only be achieved once the Internet address of the FTP server which hosts the file has been determined and the position of the file in the server's file hierarchy located. For large sites, the latter can be a non-trivial task.

More recently, however, protocols and systems are being developed not only to provide network access to clusters of files, such as that illustrated above, but also to describe some form of relationship between those files. These systems can then accept queries which are performed upon this relationship structure to provide potentially useful information about the nature of the files contained within the system.

To assist the user in navigating and locating the information that they need across these islands of information, tools have been built around these protocols and systems. The following two sections describe the most common tools currently in use within the Internet, arguably the largest inter-networked on-line electronic information system.

2.3 Distributed Resource Locators

Distributed resource locators, as their name implies, provide the user with tools to locate and subsequently access files that are shared across machines within an inter-network. Generally, they make accesses in a specific protocol to some form of distributed resource server which attempts to fulfil or resolve their actions, either personally or by contacting other distributed resource servers.

The majority of these systems are based around the client/server model, which allows clients (users) located around the world to make requests of the services that distributed resource servers provide. This affords both the client and the server a great deal of versatility, since they only need to communicate in the same protocol and in the same data format to be understood.

2.3.1 Archie

As mentioned earlier, one of the problems associated with FTP is that before a file can be retrieved, a user must know the Internet address of the FTP server on which it resides. When the Internet was in its earliest stages and was relatively small, this was quite successful. However, as the Internet has grown the number of FTP servers has increased dramatically.

Archie (Emtage et al., 1992) is a database service that indexes files which are available on publicly accessible FTP servers. In this way, if a user is sure of the file that they want, but is not sure of the FTP server upon which it is hosted, they can issue a query to an Archie server which returns a list of results that matched the query.

The Archie system provides a much needed service that is lacking within the FTP protocol, but can not be considered as a truly distributed information system. This is due to the fact that at periodic intervals, each Archie server must re-index its file list with each individual FTP server to ensure that it is up-to-date.

Another system that makes use of FTP is called Alex (Cate, 1992). Alex is a system for coordinating a number of FTP servers together to provide a unified FTP-space, derived from the individual file hierarchies of each FTP server. This has the advantage of providing a much larger virtual FTP server in which requests to other FTP servers are handled transparently.

2.3.2 Wide Area Information Server

Wide Area Information Server (WAIS) (Kahle et al., 1991) is a searching tool that allows users to specify a search criteria and then apply this to a set of selected resources. It is then the task of the WAIS server to pass the query on to each WAIS server supporting a particular resource. These servers perform a full text search against the query and return a list of matches, ranked in order of relevance.

Although the majority of resources that are available through WAIS servers are of a textual nature, since it is far easier to index, graphical images are being made accessible through the use of simple keyword association.

2.3.3 Gopher

The Gopher system (Anklesaria et al., 1993) is based around the concept of a distributed filing system that forms a hierarchical tree in which individual Gopher servers can incorporate information; intermediate nodes are equivalent to directories and leaf nodes are documents that may be rendered by Gopher clients.

At a first glance, Gopher may be regarded as similar in nature to FTP. However, unlike FTP, the Gopher protocol allows Gopher servers to be transparently integrated to provide a seamless view on a distributed document space, the sum of which is known as Gopherspace. This form of integration allows a Gopher server to place a pointer to another Gopher server at any point within its hierarchy, thus forming a primitive logical hierarchy. Additionally, Gopher search servers can be used to provide a virtual node that is the result of some search that has been performed over some part of Gopherspace.

Due to its interactive and browsing nature, there is a varied selection of information which is presently available across Gopher servers, from text to digital movies. A new version of the Gopher protocol, called Gopher+ (Anklesaria et al., 1993b), has been proposed which will enhance its capabilities in handling multimedia data, but the system has become somewhat eclipsed by the World Wide Web.

2.4 Distributed Hypermedia

Essentially, hypermedia is the concept of managing information in a structured and associative manner. Early hypermedia pioneers, such as Vannevar Bush (Bush, 1945), were among the first to consider using a machine (what was to become the latter day computer) as the medium through which to store, link and access this information. Interestingly enough, Bush predicted the information explosion that would occur with electronic information as far back as 1945!

The first hypermedia systems to become available (for example, Guide (Owl, 1987) and Hypercard (Apple, 1987)) were separate applications in their own right and required that all data be specially formatted and committed to the central management of the system. This approach presented two distinct disadvantages; the problem of converting large volumes of information and the problem of reusing that information outside of the hypermedia system.

These, and other limitations of early hypermedia systems, have led recent research into `open' hypermedia, that is, a hypermedia system which is able to use data in its native form, without having to embed additional information which will affect its use outside of the hypermedia system's environment. Open hypermedia also advocates integration with other applications on the desktop and an open linking model, where multiple and different views can be placed upon a collection of data simply by altering the linking structure. A discourse of open hypermedia principles, motivations and potential advantages is given by Davis (Davis et al., 1992), Grønbæk (Grønbæk et al., 1992) and Pearl (Pearl, 1989).

As hypermedia systems increase in complexity and attempt to integrate with more existing systems, they have come under the same pressures as early centralised computing systems: to become distributed by expanding hypermedia data and functionality into and across networks. Distributed hypermedia systems can offer users greater flexibility in the data that they can access and share, alternative views on distributed information resources through the implementation of their own linking structures and increased heterogeneity across platforms and network architectures. The general advantages of distributed hypermedia are outlined by Goose and Dale (Goose et al., 1996).

The following sections give a brief survey of a cross-section of distributed open hypermedia systems that are in current use.

2.4.1 World Wide Web

Perhaps the most used and well-know distributed hypermedia system in existence today is the World Wide Web, also know as the Web, WWW and W³ (Berners-Lee et al., 1992). The Web was originally developed at the international organisation CERN in Switzerland for its high energy physics community, but its applicability in a wider context soon became apparent.

The Web is essentially a client/server model where Web servers offer and perform services on behalf of Web clients. The basic element of information that servers can deal with is a document which is stored in the Hypertext Mark-up Language (HTML) (Berners-Lee et al., 1995). HTML is a subset of the Standardised General Mark-up Language that is oriented toward describing links and general layout within a document. Web clients are able to access servers and place requests through a communication protocol called the Hypertext Transfer Protocol (HTTP) (Berners-Lee, 1995b). It is the task of the Web client to parse HTML documents, to render their contents and to present any links that are present for the user. The HTML format also allows non-textual information, such as bitmaps, to be included within a document.

Documents are specified on Web servers through the use of a Uniform Resource Locator (URL). A URL consists of three components; the communication protocol to use, the Internet address of the Web server and a directory path representing the location of the document within the file hierarchy of the Web server. URLs are flexible enough to allow a number of protocols to be used, for example, FTP, Gopher, Network News (NNTP), etc.

The functionality of the Web can be considered as both an open and closed hypermedia system. It is open in its support for networking protocols, data formats and client extensibility, but there are two main criticisms of the Web which can characterise it as a closed system.

Firstly, the rigid client/server model it supports makes it difficult to scale due to potentially high pressures on servers and hinders information from being presented in its most natural form, a notable example being temporal media. However, extensions to the Web are available (such as client-pull and server-push) to help alleviate this situation, but most are non-standard. Also, Web users generally make extensive use of caches and proxies to reduce server loads and improve access times.

Secondly, the Web supports a very simplistic and weak linking model; it requires documents to be augmented with HTML mark-up, only supports one-way direct links and has a rigid linking structure. A key component of open hypermedia systems is that they can use documents in their existing format and can impose multiple views on information by changing the linking structure. Most open hypermedia systems achieve this flexibility through external link databases, which contain all of the links relevant to a particular set of documents. Research, most notably the Distributed Link Service (Carr, 1995), is being undertaken to try and apply this technology to the Web. Additionally, the Web only supports single-directional button links. This has the implication that all links of a global nature, for example dictionary entries, need to be either automatically created as a document is rendered or added by hand.

The most influential development of the Web has been the introduction of Hot Java and the Java programming language (Gosling et al., 1995). Applets are executable code that can be embedded within HTML documents, transferred across the network and executed on a variety of platforms, in an attempt to move processing from the server and onto the client. They achieve heterogeneity by being written in Java, an object-oriented programming language that is compiled to platform-independent byte-code and then interpreted by a Java-aware Web client. The application for applet code has yet to be fully realised, since current examples are quite primitive. However, it is a significant step forward for Web technology and will present some interesting developments for the future.

2.4.2 Hyper-G

Hyper-G (Kappe et al., 1993) is a distributed hypermedia project that is being developed at the Technical University of Graz in Austria whose original aims were to provide a general-purpose university information system to support a wide range of activities.

In a similar manner to the Web, Hyper-G functionality is provided by a set of Hyper-G servers. However, unlike the Web, Hyper-G provides a unified view on distributed resources by allowing aggregations of documents, called collections, to span multiple Hyper-G servers. Moreover, Hyper-G permits a more flexible linking model due to the fact that links are stored in external link databases. Although Hyper-G supports bi-directional links, they are still of a button-oriented nature, which makes global and repetitive linking difficult.

Hyper-G has been extended from its original aims to provide a large-scale, distributed, hypermedia information system that gives the intuitiveness of a top-down hierarchical navigation model with the immediacy of associative hyperlinks. Additionally, Hyper-G advocates the use of an object-oriented database to provide information structuring and link maintenance facilities.

In a distributed context, Hyper-G attempts to maintain scalability by using a document naming system that allows local replication and caching, and the integrity of link databases across server boundaries is achieved through the use of a scalable flooding algorithm. The Hyper-G model also supports a comprehensive user model which can assign permissions at the document, document group, individual user and user group level. Unfortunately, there is no provision in Hyper-G for making a distinction between a user and an author. The multi-protocol facilities of the Hyper-G system allow it to communicate with any open protocol system. Current integration is provided for native Hyper-G client, Gopher and the Web (Andrews et al., 1995), and future plans are for WAIS and FTP.

The ability to integrate with other distributed information systems makes Hyper-G a powerful hypermedia tool. Indeed, it has been suggested that all existing Web servers could be replaced with Hyper-G servers without major problems (Flohr, 1995). Hyper-G's main weakness appears to be its lack of extensibility due to a non-modular architecture, which can make it difficult to customise to user requirements.

2.4.3 Microcosm: The Next Generation

Microcosm: The Next Generation (MCMTNG) (Goose et al., 1995) is a distributed open hypermedia system that is based around the philosophy and framework of the Microcosm open hypermedia system. Both systems have been developed at the Multimedia Research Laboratory within the University of Southampton.

Unlike both the Web and Hyper-G, the functionality of the MCMTNG system is represented by a set of asynchronous, communicating processes. This modular, process-driven architecture has the advantage that processes can be added to the system to increase or augment its functionality, or to customise it to a particular user's requirements. What is different about the MCMTNG system is that it exists at the user-level, rather than at the site or domain level; there can be multiple MCMTNG systems executing within a domain for a number of given users simultaneously.

Local hypermedia functionality is handled by the processes of a user's MCMTNG system. These processes, primarily link databases, document managers, viewers and link managers, are all registered with and attached to a local message router; they register their interests with the router, send messages to the router and receive relevant messages from the router. The hypermedia model is inherited from the Microcosm open hypermedia system (Davis et al., 1992; Hill et al., 1993) and provides for external link databases and multiple link types. The most important of these is the generic link; a generic link is authored only once and automatically applies to all selections which match the original link creation data.

When MCMTNG systems inside or outside of a domain wish to communicate, they initially make contact through a domain server. The domain server provides information about the MCMTNG systems that are executing within the domain and the information resources that are available. Once the communicating parties have been established, communication subsequently occurs through the two message routers of the respective MCMTNG systems.

The MCMTNG system extends the original concept of a hypermedia application from Microcosm and re-fashions it in a distributed context. A hypermedia application is now a reflexive entity that can contain documents and/or pointers to other hypermedia applications, similar to a Hyper-G collection. This flexibility allows users to create large hypermedia structures from individual hypermedia applications, according to their interests. Indeed, users can place their own views on hypermedia applications by rearranging, removing or adding other hypermedia applications.

The MCMTNG system is currently in a prototype stage. However, the flexible and extensible nature of the architecture and the use of open protocols mean that it is possible to integrate with other systems and protocols. The future work of the MCMTNG system lies in exploring the possibilities of the reflexive hypermedia application framework within the context of a widely distributed environment.

2.5 Distributed Information Management

Distributed information management is a term that has been traditionally been used to describe the integration and management of distributed database systems. However, more recently, it has been used to describe the integration and management of distributed information systems and resources across networks and protocols (De Roure, 1996).

The following sections describe four issues which are key to achieving successful distributed information management.

2.5.1 Resource Discovery

The purpose of resource discovery is to search through distributed information systems and to present new sources of relevant information to the user. Since resource discovery mechanisms are being employed due to the fact that there are too many distributed information systems for the user to search manually, the searching algorithm must be accurate to ensure that relevant data is not overlooked and data that is not relevant is discarded before it reaches the user.

Another function of resource discovery is resource monitoring: the active task of notifying the user when the contents of a resource change. This is particularly useful if the user is monitoring temporal media, for example stock prices, but can also be used to indicate when a user should revisit information resources.

Fortunately, nearly all of the distributed information systems described previously contain some form of searching algorithm that allows their textual contents to be interrogated. Unfortunately, some provided richer search mechanisms than others. For example, FTP provides a very weak searching mechanism since it indexes on file name, whereas Web search engines index on document contents.

2.5.2 Information Integrity

As information becomes distributed across wide areas, information integrity becomes a real need and a real problem. Due to the problems of packet loss and latency associated with networks, it is difficult to ensure that consistency updates are made in a timely fashion. Additionally, when considering collaborative working environments, versioning and update control needs to be implemented to ensure that edits are not lost.

In terms of hypermedia systems, the problems of link and document consistency must also to be handled. Link consistency deals with ensuring that the integrity of links is maintained, even if the source and/or destination anchor moves. In most cases, link inconsistency is relatively easy to deal with, since it either involves removing the link (if it is no longer valid), or re-pointing the start or end anchor to the new location.

Document consistency is a much more difficult problem, since it deals with the contents of a document changing and can also infer link inconsistency. If the end anchor of a link points to a keyword in the centre of a given document and that keyword is subsequently deleted in an edit, how is this resolved?

Unfortunately, in most instances of inconsistency, some form of user intervention will be required to ensure that the damage is repaired correctly. In all but the most trivial of cases, consistency algorithms can do little more than highlight the problem for the attention of the user. However, in distributed hypermedia systems that employ separate linkbases, such as Hyper-G and MCMTNG, the task of consistency checking is made simpler since interaction only occurs between the linkbases. Where link information is embedded, as with the Web, interaction must occur between all documents within the system.

2.5.3 Navigation Assistance

Navigation assistance is a process of assisting the user in navigating some form of information resource. This information resource could be the information contained within a distributed information system or could be the information generated by a number of resource discovery algorithms. Either way, a navigational assistance algorithm can be employed to protect the user from information overload.

Oren (Oren, 1987) sums up the role of this algorithm as:

"A parallel would be the human reference librarian who does not comprehend the material in articles being sought, but does understand the conventions of card catalogues, abstract collections, citation indexes and bibliographical references. Because these relations can be made explicitly in hypertext they can be utilised without, for instance, having any deep comprehension of the meaning of any article title."
Wilkins (Wilkins, 1994) further describes navigation assistance as an algorithm that can fulfil the following requirements:

2.5.4 System Integration

A key aspect of distributed information management is the ability to manage information resources that are on heterogeneous networks, heterogeneous platforms and are represented by heterogeneous protocol formats. To this end, there is a need for distributed information management tools to be able to integrate with a wide range of distributed information systems

However, integration is more than protocol conversion, since there is a semantic problem to be overcome. For example, how are links translated between the Web and MCMTNG? If there is more information represented in a MCMTNG link, how is this stored within a Web link? Further more, is it possible to apply links across distributed information systems? If so, where are these links stored and who resolves them?

This illustrates that there are two fundamental approaches to system integration: arming distributed information management tools with the necessary information to be able to converse with multiple distributed information systems and equipping them with the necessary information to make semantic protocol conversions.

2.6 Summary

This chapter has highlighted the essential problem that faces all inter-network users today, and possibly for the future. As systems such as the Internet continue to grow unbounded, then the amount of information that is available within the system will also continue to grow. This leads to the concept of information overload.

Currently, each distributed information system possesses its own set of tools to assist the user in navigating its information resource. However, these tools are only of value within that particular distributed information system and generally only perform the function of resource discovery. Therefore, there is a real need to develop tools that cross distributed information system boundaries to provide the user with a view on the entire collection of information resources that are available.

Additionally, distributed information management shows that the user requires more tools than simplistic resource discovery mechanisms. They require additional discovery aids, navigation aids and maintenance aids, each of which should integrate with a wide range of distributed information systems and should also perform protocol conversion (both syntactic and semantic) in a sensible fashion.

This thesis advocates that a potential solution to this problem is to employ agent technology to perform these task on behalf of the user. These agents would be given goals by the user and work autonomously and intelligently to achieve those goals. The next chapter examines the state of the art in agent technology and attempts to classify the differing views on what an agent actually represents.




[Top]

[Back]

[Next]

[Bottom]


EMail: jd94r@ecs.soton.ac.uk
WWW: http://www.ecs.soton.ac.uk/~jd94r
Copyright © 1996, University of Southampton. All rights reserved.