University of Southampton

Structure and Hypertext

by

Leslie Alan Carr

A thesis submitted for the degree of
Doctor of Philosophy

in the
Faculty of Engineering and Applied Science
Department of Electronics and Computer Science

November, 1994

UNIVERSITY OF SOUTHAMPTON

ABSTRACT

FACULTY OF ENGINEERING AND APPLIED SCIENCE

DEPARTMENT OF ELECTRONICS AND COMPUTER SCIENCE

Doctor of Philosophy

Structure and Hypertext

by L. A. Carr

Hypertext techniques are now beginning to be used in the ways that early researchers anticipated, from personal note taking to online help for ubiquitous computing environments and universal literature resources, yet despite this, hypertext models have remained substantially unchanged. This thesis investigates the nature of text, how it may best be modelled on a computer and how the connections between related texts may be expressed in a flexible and efficient way.

First we look at the development of hypertext systems and then compare that with the complex structured nature of texts themselves. LACE, a small-scale hypertext system based on structured texts is introduced and compared with other hypertext systems. Approaches to large-scale distributed hypertexts are discussed, and LACE-92, a system used to produce hypertexts from distributed information services is presented. Finally LACE-93, a new document architecture for global hypertext environments is proposed.

TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES

ACKNOWLEDGEMENTS

I would like first to thank Prof. David Barron, my supervisor throughout the period of this PhD, who convinced me as an undergraduate that documents were of crucial importance in the study of Computer Science. Much gratitude is due to Sebastian Rahtz who named LACE, touted LACE at conferences and transcribed a conversation between his two cats which defined the term `third generation hypertext' in terms of LACE. Thanks also to Gerard Hutchings for sharing an untidy office and on many occasions straightening out an untidy mind; to Wendy Hall for provoking me into thinking and writing and to Hugh Davis for helping to bounce ideas around and generally keeping me from panicking.

But most of all, many thanks and much appreciation to my wife Jan who has at last received a straight answer to the question "How long until you finish the last chapter?".

1. OVERVIEW OF HYPERTEXT MODELS

The concept of hypertext is commonly attributed to Vanevar Bush [31] in the 1940s and Doug Engelbart [53] and Ted Nelson [108] in the 1960s, however the systems which they pioneered have, at first sight, little in common. This chapter examines the various models which have been proposed and implemented for hypertexts, and discusses the effects of these models on the end users (both producers or consumers) of the hypertext.

1.1 Introduction

The classic definition of hypertext made by Nelson in [108] is ``forms of writing which branch or perform on request'' or more simply ``non-sequential writing''. Various other terms were claimed in the same work, notably hypergrams for diagrams which performed similarly and stretchtext for literature which expanded and contracted to greater and lesser degrees of detail upon request. Neither of these names have become well-known, although the concept of hypergrams has been subsumed by hypermedia--performing non-sequential drawings, images, sound and video, and the concept of stretchtext has been most effectively demonstrated (under the more prosaic alias of `inline replacement') by the Guide system (see section A1.9).

Conklin's famous introduction to the subject [42] classifies the various hypertext systems into four broad application areas. Two of these were macro literary systems for large-scale literatures and problem exploration tools for highly flexible personal use on a much smaller amount of information (the third was a derivative of the former category and the fourth a miscellaneous category). However, it is important to realise that the two applications are not somehow separable: a scholar who needs to study and `inwardly digest' a topic needs not only to express his or her own thoughts, but also to draw on the information capacity of a huge library with the musings and conclusions of other researchers. These two application areas are then simply complementary functions that the same hypertext system should be able to provide. Van Dam [147] emphasises the ability of the computer to enhance connectivity, and that must be provided both in the large (in the sense of a newly published book being added to a library by connecting its key words and concepts to the huge web of pre-existing information links) and in the small (by connecting and re-connecting the hypertext structures which represent an author's initial ideas according to a changing and evolving personal understanding of the subject).

A review of the hypertext literature of the last twenty years shows that hypertexts have been implemented variously as:

* indivisible information nodes connected by links: Memex [31], NoteCards [63,64], KMS (a commercial version of the ZOG research project [98]), HyperCard [5]

* structured nodes connected by links: Augment [53], Dynatext [46]

* documents and modular sets of connecting links: Intermedia [148], Microcosm [55]

* a knowledge-based framework of texts: Aquanet [92], gIBIS [10], Much [117]

* non-linear `unfolding' text : Guide [27]

* everything incestuously connected to everything else: Xanadu [109]

* a word-processed document with added links: Word for Windows [101]

The above systems mainly share a common general model of hypertext: `nodes' of information which are `linked' to each other and so can be characterised by the linking mechanism, by the nodes that are linked and by the user interface which determines the way that the user encounters and manipulates these node and link objects.

1.2 Nodes

Nodes are the most obvious features of hypertext systems as they contain the system's complete information store and display it to the user. Characteristics of a node which are important to the users (readers or authors) of a hypertext system are the kind of information that can be stored (text, diagrams, exotic data types or media) and the capacity of each node.

1.2.1 Node Contents

Some hypertext systems can only hold text (for example, NLS and ZOG), some also allow diagrams and bitmapped pictures to be displayed (HyperTIES, HyperCard and Guide) while others gave extensible mechanisms for interpreting many different kinds of information (e.g. timelines or video) inside the hypertext network (Intermedia, Notecards, Word and Acrobat).

It is important to realise that simply displaying a new information type is not sufficient for true multi-media hypertext (or hypermedia). HyperCard may easily enough be persuaded to display sequences from a videodisk, but unless the methods for creating links between video sequences or between video stills and text are in place then the system is simply acting as an expensive remote control device. It is imperative that each new information type be completely integrated into the system's hypertext network. Both Intermedia and NoteCards are successful in this area because they themselves are built on extensible environments (Object-Oriented C and Lisp respectively) so that modules defining hypertext functionality for new information sources may be slotted into the existing application's framework.

A compromise often adopted is to graft a `dumb' medium (like videodisk) onto a true hypertext network by associating each frame/sequence on the video with one `place holding' node in the network. Hypertext operations on the new medium are then mapped onto existing operations on a normal node. For example, node A will always activate the showing of video frame A', and node B will cause video sequence B' to be played. A link from frame A' to sequence B' can then be effected by placing a button on node A which links it to node B. The nodes would commonly display a brief textual description of the video sequences that they shadow. HyperCard is an example of such a hybrid hypermedia system, since HyperTalk extension commands are widely available which interact with external devices (such as a video player) via the host computer's communications port, although no information can be received from the remote device.

Even text-only systems may be classified according to the kinds of text which can be displayed. Some allow only fixed-font text (early systems such as NLS, HyperTIES and original versions of HyperCard), some allow multiple text fonts, sizes and styles (later versions of HyperCard, Guide), and some provide full composition facilities, including horizontal and vertical spacing (Word).

1.2.2 Node Capacity

A second significant issue relating to nodes is their size. Systems such as ZOG and HyperCard have rigidly fixed-sized nodes, whereas others are more flexible. The size of a node is not just a function of the size of its host system's display since both ZOG and HyperTIES use a standard `terminal-sized' display, though the latter allows a node to be constructed from several screensful of information. Neither is it solely to do with the sophistication of the display hardware: HyperCard (fixed, screen-sized nodes) and Guide (unlimited scrolling nodes) both run on (small) bitmapped displays which make heavy use of windows and pointer devices, whereas ZOG (fixed screen-sized nodes) and NLS (flexible node length) both run on dumb terminals.

The size of the display is an important factor in any information system since it is desirable to present an uncluttered, easy-to-read screen to the user. Narrow newspaper columns are difficult to read because the eye has to keep rescanning to find the start of the line; small screens force the reader to keep flipping between pages of information.

If display-size is a problem for the reader, node-size is a problem for the author. Each node in a hypertext network typically requires a unique name and occupies a unique space in the information `hierarchy'. If the size of the node is fixed, then the author is faced with two choices, either to edit the material to fit within the node or to split the node into two new ones. The former choice may be effective during the initial construction of the network, but extra material that is to be added during a process of revision may force the latter to occur. If so, the split can be undertaken in two ways, either by re-partitioning the information into several logical chunks, each with its own node, or by creating a linear sequence of continuation nodes. Both of these courses of action may have the knock-on effect of invalidating some of the links to the existing node, requiring further revision work.

1.3 Links

The way in which links are implemented is key to any hypertext system as they provide the `non-sequential branching' which is at the heart of hypertext functionality. If a hypertext system is really about enhancing connectivity, as has been claimed above, then how are the connections made? Can they be adequately visualised? And how accurately can they target the information that they link?

1.3.1 Link Representation

Systems either give links the status of first-class objects and allow the user (reader or author) to manipulate them directly, or support `insubstantial' links which are hidden in the system (possibly even part of the text) and only have an observable existence when the user in some way `actions' them.

Intermedia is a prime example of a system with first-class links. There the links are stored separately from the documents to which they refer. These links also have the capacity for holding information in the form of attribute-value pairs that can be used as part of a reader's query.

It is the first-class nature of links that allow systems to provide a graphical browser. Without explicit storage of the links between the nodes it is impossible to tell the overall structure of the network without actually traversing it. Hence only systems such as NoteCards, Intermedia and derivatives of HAM provide such a function.

Insubstantial links are merely specifications for the address of a jump and as such only have any existence when they are invoked. HyperCard links, for example, are generally buttons containing the instruction ``go to card id 42106'' (or similar). However, buttons need not contain jump instructions, and, in fact, such a jump is not constrained to be attached to a button: it can instead be a part of a more general handler that is invoked when a particular key is pressed or invoked on some arbitrary event. It may also be the result of a more general computation, such as ``look up the id of the card which matches the text that the user has selected and go to it''. Because there is no clear correspondence between any particular object and the set of links in a hypertext network, HyperCard has no means of manipulating the network as a whole. Although it does maintain a graphically displayed list of the last 42 cards visited, no information is kept of alternative routes that may be followed.

NLS links are simply stored as part of the text of each node. Selecting the link reference causes a hypertext jump to be activated, but once again it cannot be easily distinguished as a link except by the user who understands the context. HyperTIES and ZOG also make use of links embedded in but displayed differently from the nodes' content. HyperTIES highlights the embedded link by displaying its name in a contrasting font, ZOG uses spatial highlighting, separating the links from the text. In both cases `address' of the destination node is the same as the name of the link.

Guide is more perverse in that it has a clear idea of the existence of links and indeed cannot display the document without a knowledge of the state of each link, but it has a less-well defined concept of nodes. As has been explained, each Guide document is considered a continuous scroll of material with links `folding' and unfolding spans of material. However, the reference links to other documents are of the insubstantial kind.

Insubstantial links are by nature unidirectional (HyperCard has no ``come from'' command), which make it difficult to model a naturally symmetric relationship between nodes but instead give rise to a directed walk-through of a document. It also leads to the ``you can't get there from here'' phenomenon, where many links may lead into a node, but none out, so leaving the reader stranded.

1.3.2 Link Granularity

The granularity of a link is an important consideration. In many systems the destination of a link is a node and no targeting of the information contained therein is possible. As has been mentioned before, this can cause problems when editing a node requires splitting its contents amongst several new nodes. Another shortcoming is that traversing a link does not necessarily take the reader's attention straight to relevant information, but requires them to search for it on the new node. Under such circumstances the reader must be completely clear about why the link is there and what is supposed to be taking them to. Intermedia solves these problems by having any span of text within the node as the destination of the link. A link anchor is placed next to the text to draw attention to it.

The granularity of the link source is also important. By limiting the positioning of links to an empty space at the bottom of the frame, the Memex forced the whole frame to act as the link's source. This of course makes it very difficult to know what is being linked from, i.e. what is the key phrase on this frame. Since there can be many links present on each frame, it makes the job even more difficult, especially compounded by having complete frames as link destinations. Intermedia is much more flexible since it allows any span of text within a node to act as the link source. NLS, ZOG, HyperTIES and Guide all have specific words and phrases within the text that are bound to the link, whereas NoteCards, HyperCard and Acrobat anchor a link at a single position on a card. As has been mentioned before, HyperCard's buttons are fixed at a geographical position on the card rather than a logical position within the text, making it very cumbersome to edit a node.

1.4 User Interface

Although nodes and links provide the basic facilities of a hypertext system, the user interface is of paramount importance since it determines the ease of use of each of the facilities. A library, after all, has many sophisticated indexing and cross-referencing facilities but the manual procedure required to follow them through is time-consuming enough to deter casual browsing.

All systems make it easy to follow the natural progression of a document whether it involves scrolling through a linear document (Intermedia, Guide) or tree-walking through a hierarchical structure of nodes using next-sibling and return-to-parent operations (ZOG, HyperCard). Of more interest is how cross-reference links are encountered--here there are several considerations for a reader to be aware of--how can a link be recognised (what is its visual representation), and once the link opportunity has been recognised how can the link be invoked?

1.4.1 Link Display

The Memex relied on marginalia facilities in which the information about a link was stored on a blank space on the node. This allowed no `inline cueing' but simply alerted the reader to a link from something on this page to further information. All other systems provide facilities for marking the parts of a nodes information which are the subjects of links. Some (ZOG, HyperTIES and Guide) by highlighting the appropriate key words or key phrases, others (HyperCard, Intermedia, NoteCards) by providing link anchors or buttons which act like a footnote marker, alerting the reader to the presence of auxilliary information related to the information which appears here. NLS stands alone as it provides no visual cueing of links, but a semantic cueing akin to writing ``see statement 4b''.

1.4.2 Link Triggering

With the Memex and ZOG this was achieved by a mapping between each available link option and a key on the keyboard (exactly like choosing from a menu). For Intermedia, NoteCards, HyperCard and Guide this is done instead by pointing at the link and clicking on it with the mouse button, in line with the direct manipulation model. HyperTIES has a slightly different (keyboard-based) model: one link-point remains highlighted but this hot-spot is moved from link-point to link-point under control of the cursor keys. When the correct link has been selected the user activates the link with a different key. NLS also has an explicitly two-part operation (select and activate) by making the reader select the link text with a mouse before the "follow-this-link" action is taken. This type of approach is highly extensible as it allows the user to control both the selection and the action to be taken, allowing dynamic data lookup rather than just the normal ``jump to another card which also makes reference to this'' consequence. This issue is elaborated in [55] with a description of a system which works on the basis of selections and actions.

1.4.3 Hypertext Jumps

Once recognised and selected the system in all cases performs an immediate hypertext jump to the linked information. It is interesting to question why this should always be so: if someone comes across a citation in a library journal they do not stop what they are doing to follow up the reference. Rather they take a note of the reference (perhaps looking to the bibliography at the end of the article to evaluate the true relevance of this item) and follow it up after they have finished reading the journal. The instantaneous jump phenomenon is more suited to a computer than a human since it leads to a depth first traversal of the network with a growing stack of outstanding articles to be finished (exactly mimicking the operation of a computer following a program through nested levels of subroutines). Hypertext systems which implement delayed jumps may well act to cut down the disorientating effects of navigation.

A common side-effect of making a hypertext jump is losing the context of the original information since many systems can only display one node at a time on the screen. This adds to the disorientation of the user who is navigating through an unknown network of unfamiliar information. NLS contrives (under certain circumstances) to lock part of a node onto the display, keeping some familiar reference for the user despite being limited to a terminal-sized display. NoteCards and Intermedia allow many nodes to be displayed at a time and so avoid the dangers of losing the reader but at the risk of swamping him or her with too many pieces of concurrent information. Although the display can handle many windows some are bound to become partially concealed. This is made worse by the hoarding instincts of the reader who is frightened to put away any window `just in case' it is needed again. Guide has a novel approach to these problems as it performs an inline replacement of the link cue by the information stored at the link's destination. Hence clicking on a highlighted key-phrase in Guide may cause it to be replaced by a paragraph giving a fuller explanation of the phrase. In this way both the source and destination contexts are still visible to the reader and hyperspace disorientation is minimised.

1.5 Summary

In this chapter we have seen hypertext as blocks or streams of text (or other forms of information) linked to each other (according to some set of rules). While adding functionality to a static print format, some of the basic features of these systems (fixed nodes, coarse links) make the hypertext cumbersome to write or difficult to comprehend.

The `hyper-' prefix in the word `hypertext' indicates `more than', thus hypertext is literally `more than' text. The advantages of nonlinearity, cross-reference jumps and multimedia information have to be balanced against some of the disadvantages mentioned above or the resulting system will implement hypotext, text which is less useful or less accessible than its print-bound counterpart.

In the following chapters we will examine the nature of text and how it can be adequately modelled by a computer. Then we will look again at hypertext, in particular a system developed by the author in 1988 which is based on these ideas of text. We then examine the use of computer environments which aid reading or writing both electronic documents and electronic non-linear hypertexts and then return to re-examine the models of hypertext beyond the `nodes and links' seen here.

2. TEXTS

It is important to consider for any hypertext system, such as the ones described in the previous chapter, not only the technology and user-interface of linking texts, but also what it means to make a link between two texts, and the meaning of a linked network of texts. The impact of these considerations on the usability of the resultant hypertexts can be understood when it is considered that a single text is itself a complex network of related components.

In this chapter we look at the way in which texts themselves are constructed, how texts can be adequately represented on a computer and compare the way in which hypertext systems model a network of linked texts.

2.1 Text as Structures

In a study of hypertexts and their features, it is instructive to consider the nature of `ordinary' or `traditional' texts. We can ask the questions: what is a text? How is a text composed by writers, and how is it understood by readers?

In a computing environment, `text' is usually thought of as a simple sequence of characters as defined by the ASCII encoding. In fact, even to the present day, the most prevalent type of document or file is the text file: a sequence of ASCII characters whose record ends define the line breaks for display purposes. However a text in its fullest sense consists of more than just its encoding and layout information. It principally contains high-level cognitive information which is communicated in a natural language expressed by the character coding and layout. It is the expression of this embedded information which is the crucial purpose of the text, and the responsibility of the writer to construct the text in such a way as to accurately present the information to the reader and in a fashion which can be easily assimilated by the reader. In this way the intent or purpose of a text is not to produce a suitably formatted piece of paper, but to inform a reader.

Human cognition is often described in terms of a semantic network into which new facts are added through the learning process (see for example [54]). This gives rise to a close correspondence between texts and computer programs. A program specifies a set of actions which, when elaborated by a computing processor, produce a certain change of state in that system. Similarly, a text, when elaborated by a suitable cognitive processor, produces a change of state in that system, with an increased understanding achieved by an incrementally updated knowledge network.

2.1.1 A Text is more than a Linear Sequence

Nelson defined hypertext as any form of writing which cannot be simply expressed as a linear sequence, but it is not only the new technologies of hypertext which have extended writing beyond the linear sequence. Various techniques have evolved within conventional text that allow richer forms to be expressed than a simple one-dimensional elabortaion of facts. Hierarchical structures are a case in point: the sub- and super-ordination of concepts achieved by the various levels of the tree structure is mapped onto a conventional linear text sequence by a pre-order tree expansion. Hierarchies are seen particularly in technical and legal writing through an extensive range of different `levels': parts, chapters, sections, subsections, subsubsections, paragraphs and subparagraphs. Other literary forms have shallow hierarchies: plays have Acts and Scenes, novels have parts and chapters, newspapers have sections and articles.

Hierarchies bring structure to a text, but even sequence itself is a simple structuring tool, allowing a directed development of an argument or the build-up of context. A narrative text often makes use of parallel threads (sequences) which are interleaved throughout the text. Cross-reference is frequently seen in technical writing, allowing the author a means of emulating a network of ideas rather than a fixed hierarchy.

2.1.2 A Text as a Hierarchical Structures of Ideas

We are familiar with the use of structure to decompose complex ideas within a text, since technical documentation, such as manuals and reports, is made up of chapters, sections and subsections in which ideas are developed in successively greater detail as the reader descends through the hierarchy of sections. However, this familar (and often explicit) hierarchy is not the only type of structure which is present in a text. The study of discourse theory [142] identifies large-scale super-structures which organise complex semantic information within the text. The text is not just a sum of the component parts; the function of superstructures is to add meaning to the text by defining its global coherence. Superstructures contain two other kinds of structure: micro-structures which relate ideas at the level of words and phrases in a fashion analogous to semantic networks and macro-structures which summarise and encapsulate the meaning of statements and paragraphs. Whereas macro-structures are distilled content, superstructures are independent of content.

According to [141], the functions of superstructures become conventionalised in a given culture, leading to fixed schemas for the global content of a text. The following five superstructures are identified as being common to many text types:

Introduction: presuppositions and background

Problem: a twist on the state of affairs

Solution: resolution of the above

Evaluation: discussion of consequences

Conclusion: closing/summary

Stories, scientific papers, dramas and arguments are all identified as containing the above superstructures. Despite the variety of texts, the author is leading the reader towards a particular conclusion via a particular interpretation of the facts--a directed presentation. In each type of text, the structure acts both to direct and constrain the content--no introductory material is allowed to appear within a concluding section, nor is the conclusion allowed to precede the material which supports it.

2.1.3 A Text as a Hierarchical Structure of Presentation

Superstructures are not related to content, but at a meta-level in which the author is making decisions about the presentation of the semantic content. This `presentation structure' is the form of written or spoken communication used to present ideas in the form of narratives, technical documents, arguments or dramas. The model for conventional `texts' is that of a lecture in which an author actively educates an audience. The structure of the text (or presentation) is key in the audience's understanding of the content because it assigns a role to each piece of information, showing some facts to be subordinate to or of equal importance to others. The relationships made explicit in a cognitive structure are similar, but deal with the `plain facts of the matter' rather than the directed interpretation of an authored text.

2.2 Representing Structured Texts

Having briefly explored the complexity of a text, with its concurrent internal structures (semantic microstructures, organising macrostructures, presentational superstructures), let us consider how this this complex information can be represented on a computer. There have been various schemes for representing texts as computer data: originally the purpose of a text was to produce a printed document, and so the first structures which were explicitly coded were those of physical rendering for display. We first look at these methods, and how they have developed to allow coding of the abstract structures of the content and rhetoric of text.

2.2.1 Physical Representations

Markup was the process of marking a manuscript with instructions to the human compositor for rendering the manuscript in print. When the manuscript became a computer file and the compositor a computer program, markup instead comprised the codes inserted into the text to control the composition program. These codes may be explicitly inserted by the author (in the case of typesetting systems like LATEX [85] or troff [8]) or added `behind the scenes' as a consequence of the author choosing a particular style from a menu (in word processor systems like Microsoft Word [101] or WordPerfect). These codes control printed (physical) attributes of the document, such as the fonts and spaces used to render the text (Figure 2.1a), and mimic the pre-existing technology used by printers, so the models which both the above kinds of programs manipulate is that of a book, magazine, memo, or letter--i.e. any printed item.

.ce 1

.ft B

A title, a title, my kingdom for a title

.ft R

.sp 0.5i

In this chapter we look at the possible

Figure 2.1a: Nroff physical markup for a section heading

.H 1 "A title, a title, my kingdom for a title"

In this chapter we look at the possible

Figure 2.1b: Nroff mm logical markup for a section heading

Because of the tedious and repetitive nature of this `physically oriented' low-level typographic manipulation, markup languages adopt procedural abstractions (Figure 2.1b) which mirror higher-level physical document constructs like display paragraphs, hanging indents, bulleted lists and headings, and reflect a document's logical or abstract composition, such as its construction from chapters, sections and subsections, figures and tables. Emphasised text is no longer marked up with prescriptive physical commands to "change the font to italic", but with an abstract declaration that "the following text is emphasised". It is now the responsibility of the composition program to know how to suitably render emphasised text--in other words its role has expanded from dealing with page imaging semantics to dealing with document semantics. The advantage of this style of markup is that the author can concentrate on expressing ideas within an appropriate logical framework without worrying about issues of presentation that the compositor should deal with.

Markup systems which adhere to this philosophy (troff+mm, LATEX, GML) emphasise the logical nature of their markup, especially the facilities for expressing the document's overall hierarchical structure. However a closer inspection reveals that such markup is still implicitly tied to describing the physical layout of a printed document. In fact mm and LATEX markup for a `section' or `chapter' is defined in terms of lower-level primitives for changing fonts and leaving vertical space, and so is still a physically-oriented markup, rather than a truly logical one. Document structures like `sections' are catered for in name only; in fact one is really marking up the section heading alone. Another apparently separate component of the logical document structure, the footnote, only has any meaning in a paginated environment and may need to be re-interpreted as a marginal paragraph or an endnote in an on-line text presentation system.

2.2.2 Logical Representations

Logical markup has been taken to its (logical) extreme by SGML, the Standard Generalized Markup Language [7, 76], which defines a regime for document markup without any predefined processing operations or any built-in document structure semantics. This lack of built-in semantics is both SGML's greatest strength and greatest weakness. The strength is that SGML forces abstraction from the eventual document delivery medium and allows a content-based approach, imposing a cognitive discipline that brings benefits beyond any immediate publishing requirements. However, SGML can be used to code all the different categories of logical structure: semantic structure vs presentational structure, microstructure vs macrostructure.

SGML specifies each document architecture with a DTD (Document Type Definition) defining the hierarchy of structures which may compose the document. This architecture may be used by an interactive document editor to check the structure of the document being created, or by a document formatter to process the entire document. There is a strict syntax associated with the architecture which may be understood and verified by any SGML-compliant application, but each application is responsible for interpreting the meaning of the document structure, according to its requirements.

For example, figure 2.2 shows how a biographical dictionary may be marked up. To produce a printed document it may only be necessary to specially highlight the start of the entry, the name and dates of birth and death of the individual. The other tags may be completely ignored during formatting, with the text set as if they were not there. However, when forming a biographical database from the same document it may be deemed important to identify all the information marked above so that the database can be used to determine everyone who was educated at a particular university. Without the extra markup it would be impossible to pick out these details that make the data useful for many purposes apart from printing. The markup makes explicit the information that is embedded within the text, and this information can subsequently be reused in different ways.

<entry>

<biographand><name>John Smith</></>

<dob><day>12<month>June<year>194</dob>

<dod><day>1<month>Feb<year>1987b></dob>

John was born in <birthplace><place>Edinburgh</birthplace> and studied <subject>English</> at <education><place>Southampton</> University</>, graduating in <graduation><date><yr>1956</graduation>. He married <spouse><name>Emma Jones</name></spouse> in 1962 and became <profession>MP</> for Southampton in 1975 until his death.

</entry>

Figure 2.2: SGML markup for an entry in a biographical dictionary

According to [58] any document can be considered to have 3 parallel structures associated with it. These are the abstract representation which is concerned with the logical structure of the information contained in a document and made explicit by some form of high-level markup, the physical representation which is determined by a formatting process and the page representation which is defined by a viewing process. The physical representation corresponds to the document formatted for output on an infinitely long scroll, whereas the page representation is concerned with how the formatted representation can be mapped onto discrete pages. We have already seen that there are in fact a number of different structures which comprise the abstract document. The bibliographic dictionary example in figure 2.2 has a very fixed database-type micro-structure. The dictionary simply consists of an order list of entries, with no combining of data between entries into any higher level structure. In a report document, conversely, the superstructure would be combined with content-based macro-structures but probably without the detailed exposition of the microstructure.

In a text-processing environment, the technical author usually manipulates the abstract document (which is a union of the content of the document and the distinguishable markup interleaved with the content) via a text editor. This is normally done by treating the markup as text and providing a standard set of text manipulation functions which apply to both the content and the markup. Alternatively the editing process may provide the author with the same set of text manipulations but treat the markup separately by graphically interpreting it through the use of indentation, whitespace and highlighting (as found in IBM's LEXX editor). In a WYSIWYG environment the author is usually directly manipulating the physical representation (the-document-as-a-scroll) with all `markup' inserted invisibly and interpreted faithfully on the screen. There is little concept of the abstract structure, although `style sheets' give the illusion that such a structure exists by allowing logical names to be associated with groups of physical formatting specifications that are applied to specific paragraphs. Some WYSIWYG systems (e.g. Microsoft WORD) give the author access to the page representation, or the option of swapping between both representations as required.

2.2.3 Representing Multimedia Documents

Click here for Picture

Figure 2.3: A multimedia document

In a modern multimedia document environment, the models manipulated by the computer programs are no longer those of traditional printing technology, with common interfaces and operational semantics. A document may consist of a collection of video sequences, audio clips and computer animations as well as text. Abstract but physically-based markup can no longer be used to define `how to' present each piece of information because there is (as yet) no standard practice to follow for presenting non-textual information. In any case instructions such as `leave 2 seconds of space and then show this video clip in the top-left corner of the screen with that text next to it' leave little room for true hypermedia which necessitates user-directed interaction.

Although physically-oriented markup has a limited role in a truly multimedia environment, it has a crucial role to play in describing each different component of the document, describing its representation and its purpose (especially for non-textual information). The markup can therefore be used in two ways

i) To encode or represent the various document objects themselves

ii) To describe meta-information about the objects or their intended use.

For example, figure 2.3 demonstrates a document which consists of some text, a diagram and a video sequence. Figure 2.4 shows how it might be coded according to an SGML DTD, with the document structure used as a container to hold the various encodings of the document media (text, picture and video). The contents of the text objects are also coded according to the DTD, although the diagram object is coded according to some external scheme and the video object as a mixture of SGML markup and external scheme. All the document objects (text, diagram and video) are coded using SGML markup and also have special tags which give information about the objects.

<mmdoc>

<element type=text>Welcome to the Department of Electronics and Computer Science. <p>Click on the map below.</>

<element type=diag size=7x8 rendering=winmetafile>
AA145367382A5...</>

<element type=text>The Department was formed in 1990 as a merger between the Departments of Electronics (Faculty of Engineering) and Computer Science (Maths). This has resulted in a successful partnership of hardware and software expertise.</>

<element type=video size=3x3 rendering=frames>

<frame num=1 timecode=002701>AA145367382A5...

<frame num=2 timecode=002702>AA145367382A5...

<frame num=3 timecode=002703>AA145367382A5...
</>

</mmdoc>

Figure 2.4: Representing the document with SGML

2.2.4 Mapping Between Representations

The advantage a logical style of markup is that the author can concentrate on expressing ideas within an appropriate logical framework without worrying about issues of formatting that the compositor should properly deal with. However, if the purpose of logical markup is to make explicit the author's intent, the structure of the information contained in the document or the structure of the argument being presented, then there has to be some kind of relationship defined to bridge the gap with the formatting commands which were used to render the document in print in the first place.

Some of the issues that a document composition system must deal with in converting between logical and physical structures (legibility, readability, hyphenation, typographic design) can be seen in [126], but the nature of that relationship between these structures varies between the different systems of logical markup. For LATEX and ms/mm, which are high-level macro packages built on top of specific text formatters (TEX and troff respectively), the high level markup is simply a disguise for a sequence of physical formatting operations. Each piece of `logical markup' is actually a set of font changes and spacing commands in disguise. ODA maintains separate physical and abstract structures for a document in parallel, and provides an explicit mapping between the two structures. SGML, by contrast, deals with no concept of a physical structure, and devolves all formatting issues to separate application programs. (In fact the SGML LINK facility [29] can be used to provide stylesheet-like formatting capabilities, and the forthcoming DSSSL standard [2] defines an ODA-like mechanism for mapping SGML structure onto a physical structure.)

2.3 Hypertexts as Structured Networks

Bush's view of hypertext [31] was of an associative network in which nodes (peers of equal status) were connected by undifferentiated links. Many modern systems such as HyperCard and HyperTIES are based on this model, providing unstructured hypertext that can become very difficult to navigate around. One of the main reasons for this is the lack of information associated with each link which would otherwise enable judgements to be made about the applicability of the associated material.

Extending the hypertext model to include typed links allows a structure to be imposed on the information content. The most frequently used model for this structure is that of human memory. Bush's paper held the view that human memory worked in the same fundamentally unstructured way as the associative links provided by the Memex, however modern theories of human cognition favour the semantic network [78], which as a network of nodes joined by typed links is congruent to a hypertext network with typed links. Various authors have tried to identify sets of link types sufficient for structuring a hypertext--[52] advocates the use of seven link types: being (subset relationship), showing, causing, using, having, including, and similarity. Nelson [109] provisionally suggests a large number of link types for Xanadu including correction, comment, translation, quote, expansion, suggested pathway and citation. Xanadu link types are entirely arbitrary, but in general the set of link types should be both small enough to maintain a rigid structure and large enough to be generally applicable in all situations.

Whereas the links model the relationships in a semantic network, the nodes of the hypertext are the information content proper. For a semantic network, each node has a fine information granularity, dealing in individual concepts or propositions, i.e. textual micro-structures where each node is a self-contained entity and does not rely on a global context for its meaning. Even at this level the hypertext network may be constructed to model one of two alternative semantic structures. It may either plot the relationship between facts and concepts in the knowledge domain or depict the cognitive structure of the expert's understanding of the field. The latter choice is seen as a powerful tool since the goal of education is to transfer the expert's cognitive structure to the novice [132].

Although the cognitive structure approach to hypertext provides a sound model of knowledge and concept, and is a popular model for implementing hypertext systems [77, 81, 84], it has no model of the text itself. Begeman & Conklin document the difficulty of comprehension for readers in such a hypertext environment, since however clear the concepts in each individual node are it becomes impossible to track the thread of these concepts through many linked nodes [10]. The author is limited to expressing ideas in fine-grained, distinct units which obscure the overall development of any larger ideas, i.e. there are no macro- or super-structures. Traditional text with its linear form allows for the development of ideas and arguments building on the successive disclosure of individual points and concepts. In such a medium there is an evolving context that can be built on: in a hypertext medium there is no such context as there is no `correct' path through the network, and no guarantee that the reader will have encountered a particular piece of information.

Hence, in a cognitive structure there can be no larger theme or overall point of view which directs the authoring process as the semantic network deals only with atomic facts or propositions and their interrelationships. One of the first rules of writing taught at school, is that a text should have a beginning, a middle and an end. This is not the case in a cognitively structured hypertext since there is no ordering or natural sequencing imposed upon the information.

2.3.1 Implementation Requirements for Structure

Systems such as HyperCard do not implement links as first class objects and allow them to be visualised or directly manipulated. Instead, they are embedded in the document as a part of a sequence of instructions to be triggered by some action of the reader. As a consequence, only nodes have any concrete representation--the links (and the network they specify) have only an abstract existence as a `flow of control' of the computational process which embodies the hypertext browser program.

Since links (as opposed to buttons which anchor the links to a position on the computer screen) have no independent existence, no extra information can be attached to them in the form of names, types or attributes. It is also not possible to provide a graphical browser to aid navigation through the network (this statement refers to the `plain' package--since HyperCard contains a powerful programming language it is possible to emulate more sophisticated hypertext features).

Other systems (such as Intermedia) do implement links as first-class objects, and allow direct manipulation of nodes, links and the network as a whole. These implementations also provide for extra information (in the form of tables of attributes) to be attached to nodes and links for use in the navigation process, allowing the reader (or the system) to filter links according to some criterion of relevancy.

As an example, the gIBIS system [10] has a small number of node and link types that allow the reader to differentiate between the kinds of information that are connected to any particular node. When viewing a node the reader may choose to follow up other information which supports, contradicts or in some other way relates to the current issue. All this information is deducible from the type of the individual links, and allows the reader to make decisions about the linked material without traversing the link. A similar phenomenon is demonstrated by McAleese's [51] method of qualifying citations for traditional texts. By providing extra information about the type of the citation readers can judge whether the reference is relevant to them.

Structure in a hypertext network is the organising principle that determines how the individual nodes are arranged and related to each other through the links. Typed links provide a useful means of structuring the hypertext network by placing some organising criteria on the information accessible from each node in the network. Such an organisation of the relationship between nodes in the network may, according to the choice of link types, reflect the relationship between the propositions in the knowledge domain, the organizational structure of the author's understanding, or a higher level text-oriented structure.

The choice between various kinds of structure as a basis for a hypertext model has consequences for the users of the hypertext. We have seen that the primitive associative-semantic structure provides no context and provides certain difficulties for effective use by readers. In the next chapter we look at other structuring mechanisms which provide better navigation facilities.

3. HYPERTEXT IN THE SMALL

From texts we turn to consider hypertext systems. "Hypertext in the small" is hypertext concerned with a local context--usually a single text or group of intimately related texts. It has been shown in the previous chapter that texts are themselves complex structures of related components. This chapter introduces a hypertext system, LACE, which is based on a model of structured texts and creates hypertexts from existing structured documents. It then goes on to examine the use of structure both as a reader's navigation tool and an author's construction tool, and introduces a system called LACE'92 which is used to help author structured documents. The chapter closes by looking at intentional authored structure versus non-intentional evolutionary structures.

3.1 LACE

The previous chapter has examined the nature of `text' and the various structures associated with it. It also demonstrated the difficulties of hypertext which are structured too closely to the atomic facts of the knowedge domain. In fact, most authors are producing not hypertexts, but documents in printed or electronic form as reports, letters, or books. Modern document composition systems deal with structured documents whose contents can be rendered in different styles (LATEX, SGML, ODA, Word). This chapter describes a system which allows hypertexts to be built from these existing documents.

In [40] Cole & Brown, remarking upon the similarities between paper documents and hypertexts, state

"it also seems sensible to make provision for readers to have the advantage of hypertext navigation when viewing a document on screen, even if the document is eventually intended to be read from paper. These aims could best be achieved by having a common underlying representation for the structures of both types of document, together with well-defined ways of mapping these structures into different forms of representation... it is not suggested, of course, that a document designed for paper would necessarily make a good hypertext or vice versa, only that a usable representation should be readily available by applying different presentation styles."

Here we describe LACE [38, 118], a hypertext presentation environment built on the LATEX document production system. LACE turns each document into a database of components which can be individually addressed. A document can be viewed as a contiguous whole or have isolated components extracted. The logical structure of the document is important because it provides both coherence to the document as a whole and a mechanism for deconstructing the document.

LACE addresses the goal of automatic hypertext generation by producing a hypertext from the original sources used to create a paper document. LACE's hypertext viewer uses a document's explicit structural information (chapters, sections, floats, marginalia) and existing navigation structures (table of contents, index, citations).

The advantage that the Memex had was a huge increase in speed and convenience for the user--no more walking through miles of bookshelves or flicking through hundreds of pages to locate a single piece of information. Instead, everything was to be available through the motion of a lever. One of the appealing characteristics of the Memex was that it gave the reader just what they were used to: complete pages of text (lots of information available at a glance) which were designed especially to make the reading process easier. Use was made of both horizontal and vertical white space to set off important information and to help divide the text into units of paragraphs, sections and chapters. Different styles of letter-shape were used to further highlight and draw attention to important information. All this was available because the Memex gave a photographic reproduction of the original texts at their original size.

Limitations of size and layout hinder many hypertext systems from fulfilling the fundamental goal of hypertext. Small fixed node sizes force the author to break material into unnatural chunks thus hiding information from the reader, and the lack of typographic devices means that the reader finds it harder to locate information. Typically the author is forced to use excessive spacing in the layout to force items to stand out which in turn exacerbates the problem of size.

A further limitation is the insular nature of these systems--documents have to be explicitly authored within the system and can only be read using it. There is a need for open standards of access when building up a network database of literature so that one system may act as a hypertext viewer to a set of documents, while another can perform textual criticisms of the documents' contents and yet another may perform automatic content analysis. A great deal of effort must be put into creating an online literature database; it is imperative that the result is easily extensible and reusable.

Centuries of use of paper-based books and journals have led to many developments in their presentation which have enhanced the way in which we extract information from the page. The techniques of typography and layout to which we have already alluded, footnotes, indexes, tables of contents, citations and bibliographies all help us to navigate through printed material. Even physical attributes of the document (such as the relative thickness of the document) provide navigational clues when we read [11]. It is important that problems associated with readability and the convenience of the user-interface are solved, otherwise that which we call hypertext indicating facilities in excess of a normal text actually becomes hypotext or substandard literature which is no longer as useful as its original paper form.

LACE was conceived as a solution to some of these limitations. Instead of a fixed array of characters, LACE supports typeset pages, with different font styles and sizes used as they would be in a printed document. A page allows more information to be presented to the user, so reducing the problem of information fragmentation. LACE avoids insularity by using documents that conform to various common generic markup schemes (LATEX, WEB and troff's man macros) allowing them to be viewed in a hypertext environment or formatted for printing without modification.

In the following sections Lace will be explained according to the various concerns of a hypertext system: how to represent, store and retrieve documents (back end issues); what facilities to provide for browsing published documents and navigating through the body of published works (reader's front end); and what facilities to give provide for creating documents and for linking them into the existing body of literature (author's front end).

3.1.1 Document Representation

LACE deals with documents which are expressed in some form of logical markup (commonly LATEX). This section discusses the reasons behind choosing explicit markup over a direct-manipulation WYSIWYG model and the choice of a particular markup system. Although users of other hypertext systems (such as Intermedia) directly manipulate a document's physical representation, logical markup and an abstract document model were chosen for the following reasons:

logical markup allows authors to make their intentions explicit

logical markup allows a document to be `ported' easily between diverse applications and systems

logical markup is very commonly used within the academic community

It is the first reason which is in fact the touchstone of LACE's approach to documents. An academic document often takes the form of a reasoned argument, and an argument involves a sequential development of points and a hierarchy of ideas and information which support and contribute to the main thesis. In a similar fashion a technical document frequently follows the structure of lexical taxonomy, in which the discourse proceeds from a general class to the subclasses and their particulars. The structure of such documents allows the reader to understand the contribution that a particular statement makes to the overall argument or theme and to make relationships between ideas that are being developed and ideas that have been previously established. If the structure is not stated clearly enough then readers are left to their own devices to make decisions about the function of a new piece of information, whether it is a subsidiary point of a previous topic or whether it stands alone as a major item in its own right, leading to ambiguity and confusion. Tyler [140] argues that the understanding of a document presupposes that the text as a whole is composed of a hierarchy of parts, and that comprehension of the text comes from construing those parts. The structure with which logical markup systems usually deal is a mixture of the macro-structures and superstructures of discourse analysis--the sections, subsections and sub-subsections are often used to dissect the subject-specific content, while their agglomeration into chapters and complete documents is controlled by the genre's (implicit) superstructure.

The use of logical markup which mirrors the logical structure of a document (structural markup) reinforces the semantics of the text that is being created. This is often helpful to the author as it coincides with the process of creating an outline of the document which both directs and constrains the authoring process. It is also helpful to the reader because the formatting process may use the structural semantics to provide visual clues to aid comprehension of the flow of the text (for example, the titles of key points may be emphasised by representing them in a bold font while subordinate information may be separated from a major points by extra vertical and horizontal space). It is also useful in a hypertext environment, because it makes explicit both the division of information amongst separate nodes and an initial set of links that can be established between those nodes--in effect providing method of automating the production of a hypertext network from a `flat' document.

SGML would seem to be the natural markup scheme for use with LACE, but LATEX [85] and troff's man package were chosen instead because of their widespread use within the academic community. In both cases the markup is actually implemented by programming language embedded in a lower-level formatting engine (see [8] and [85] for full descriptions). Of the two, LATEX is more widely used as it has been ported to all major computers and is compatible with all major printer types, whereas troff is mainly available on UNIX systems. LATEX is also the more flexible since it defines a one-to-many mapping between the abstract and physical structures by use of different document styles. Support has been provided for both troff's man macros and LATEX, although the description of LACE that follows will assume the latter.

As well as reinforcing the semantics of the text, structural markup frees the author from making decisions about the visual design of the document. This is especially important when the physical representation of the information may change radically for each different publication medium (computer screen, low fidelity computer printout or book). One of the goals of the LACE project is the reusability of literature. It is important that documents authored for one system are capable of being used in another, making the use of a text-based interchange format appropriate. LATEX allows the author to `plug in' different document styles to radically alter the physical representation of the document. LACE extends this capability by defining a hypertext style that formats the elements of the document's structure for display as nodes in a hypertext network.

3.1.2 Document Storage and Retrieval

In the `real world', literature is composed of published `works' in the form of novels, plays, manuals, reports and the like. LACE deals with these works, or `documents', which have to go through a publishing process before they are available to anyone other than the author. Both in the real world and in LACE, each document is in some sense complete in itself, containing all the information required for a particular purpose: a novel is published as a single entity, not as a separate set of chapters, although hypertext systems with fixed-sized nodes often force this sort of behaviour.

The publishing process which makes a document accessible to the world at large involves entering it into a host-wide database of documents presided over by the document librarian, also known as the lace dæmon. The database holds information about the document such as its title, keywords that sum up its contents, its access permissions, its type (video, LATEX, WEB or man) and its location in the filestore.

In LACE the structural elements of the document are the nodes of the network--each element of the document may be individually addressed by naming its position within the structure and the title or number associated with it (e.g. `abstract', `chapter 1', `section Troubleshooting and Diagnostics' or `table 3.5'). As well as this, each document is published by entering its details into a site-wide database. A librarian process listens for requests for individual elements from particular documents, and displays them in an appropriate fashion on the console.

Each individual document is itself a database--a database of logical elements (in the logical markup sense) and the relationships between them. The document librarian is a process that listens over the network for requests of the form Far From The Madding Crowd:chapter 4. The librarian then looks in its database to find the document whose title is Far From the Madding Crowd, checks that it has public access, and works out where it is stored. The document is then inspected, and chapter 4 is extracted from it and sent over the network to the user who requested it. Each document also has a short nick-name stored in the database along with the title, so that requests don't become cumbersome to type!

The structure-based addressing scheme acts to unify various document types. Videodisks are published divided into chapters which can be logically subdivided into different video sequences or `sections'. This allows links to be made to various media: the librarian process is responsible for displaying the various data types in an appropriate fashion (a window on the console, a sequence on a separate video display monitor or an audio sequence from a CD player). In this sense, LACE is a hybrid hypermedia system, as explained in section 1.2.1, as it allows simple access to information in different media.

3.1.3 Creating LACE Documents

Authoring a document is straight-forward since there is (or should be) no difference between creating a document for paper and creating a document for hypertext browsing. The document librarian provides LACE's low-level machinery, but the access to that machinery is largely hidden from the author. This is necessarily a part of the LACE philosophy, since its aim is to automatically convert existing literature into a hypertext format. In fact, just a LATEX aims to provide different document styles to automatically change a document's printed representation between that of a journal article, report, thesis and book, so LACE aims to provide a new document style that will change the document's representation into a hypertext network.

LACE uses all the facilities provided by the markup scheme to present the document to the reader in a useful way. For example, the author's use of chapters, section and subsections will not only allow the hypertext machinery to request that particular element, but also provides a table of contents with buttons to call up the sections automatically. This information is also provided as a menu and in pictorial form as a tree.

To publish a paper, it must first of all be converted into a form that the LACE librarian will accept. For LATEX documents this involves adding the lace documentstyle option at the head of the document and running latex as normal followed by hyperdvi. WEB documents (from TEX's structured programming language [82]) should first be processed by tex, (but should \input hyperwebmac instead of the standard webmac), and then by hyperdvi. Troff manual page are processed by hypertroff -man. All these processes create a .ps file that will be used by the librarian, and a .ps.map file that indexes the PostScript file by the original document's logical structure.

Publishing consists of making an entry in the librarian's database. This is accomplished by running the command lace -n foo, replacing foo with the name of the file that holds the document. The author is then prompted for a number of pieces of information (such as keywords describing the content of the document), and the database is updated.

3.1.4 Reading LACE Documents

LACE documents are displayed using the NEWS windowing system [61], a PostScript-based window system that is proprietary to Sun Microsystems. (NEWS was later incorporated with X11 into SunOS's OpenWindows, but has now been dropped in favour of Display PostScript extensions to X11.) The command startLACE fires up the server with the extensions needed to display all the documents (since NEWS was a class-based PostScript environment, the startLACE command defines new document classes for TEX and troff documents). Typing lace help in the console window brings up a document that gives guidance for running lace and also provides buttons to reference all the documents available on this host.

At a casual glance, LACE provides a LATEX previewing facility, and allows the page to pass as part of the logical structure. This is convenient in allowing reader to make use of the familiar book metaphor. When a request is made for a document without specifying any substructure the librarian process returns `page 1'. The LACE menu has the usual TEX previewer functions of moving between pages as well as the hypertext capability of stepping backwards through the list of nodes seen so far. The Table of Contents, List of Tables and Lists of Figures, usually seen in the document's front matter are also turned into submenus which bring up a new window containing only the appropriate document element. As well as the menu these table structures have buttons placed over each line, so that clicking on the line in the Table of Contents which refers to section 3.3 also brings up a new window containing that node.

LACE buttons are transparent `patches' which respond to a mouseclick by sending a request for a document part to the librarian process. Visual cueing is left to the document style or author to decide, though it is anticipated that the author will never have to explicitly request a hypertext link. Instead, LACE attempts to infer as many links as possible from the structural markup. This is done not only for navigation structures like the contents list and lists of tables structures, but also for cross references and citations.

Some of the document's physical representation has been changed for more suitable behaviour in a hypertext environment. Document formatters make every effort to move some structures (table, figures and footnotes) out of the main body of text as they interfere with the flow of the discourse. Usually they are `floated' to the top of a new page or a page by themselves to minimise their interaction with the surrounding text. In a hypertext network they can be taken out of the containing node entirely. Tables, figures and footnotes are displayed by clicking on cross-references to them.

Click here for Picture

Figure 3.1: The title page of a LACE document

Figure 3.1 shows the title page of a LACE document. The window is surrounded by a frame with the title of the document, and an indication of the number of pages through the document (here on page 0 out of 26). In the top left-hand corner is the close button that shrinks the window to an icon when it is pressed with the left mouse button. In the bottom right-hand corner is the adjust button that shrinks or expands the window by clicking on it with the left mouse button and dragging the window until it has the required size. The bottom left-hand corner houses the zoom button which magnifies the size of the page by clicking on it with the left button (zoom in) or shrinks it with the right mouse button (zoom out). The upward, downward, left and right facing arrow-heads are scroll buttons that are activated by the left mouse button. To move the window, click anywhere in the frame and drag with the middle mouse button.

The menu is brought up with the right mouse button. The menu items Next Page and Previous Page advance through the document and back, one page at a time. Goto Page brings up a submenu with the page numbers that can be chosen directly, and Back returns the reader one at a time backwards through the list of pages that they have visited.

The Contents submenu lists the major document structures (exactly like a table of contents) and brings up the appropriate part in an independent window. It helps the reader to navigate quickly to the information wanted, as long as its position is known. There are also submenus for Figures and Tables which will only appear if the document had any figures or tables in it. A new window with the appropriate figure or table will be displayed when the reader makes a choice from one of these submenus. Since the object has been `floated out of the document' and away from the main flow of text the user will only see the object by using these menus, or by clicking on an explicit reference to that object.

The Add to Trail menu item puts a reference to the current element on display in the reader's trail. A trail is simply a document which is composed of list of references to other documents, with the effect that list of interesting references can be saved and replayed later. An automatic trail is also kept which consists of every document element seen, whether interesting or not! The Gotos and ComeFroms items give submenus detailing the other document parts (or nodes) that are referenced by this node or that reference this node, respectively. This enables the reader to jump to any piece of information that has been marked as being relevant to this node. The last item, Zap, destroys the window.

Figure 3.2 shows the the end of a (somewhat hectic) session which demonstrates many of LACE's features. The title page of a document similar to that shown in figure 3.1 is in the top left corner, as are further pages. Page 1 is shown slightly to the right, displayed in the same fashion as a printed page, but with a number of hypertext effects added, all implemented as buttons on the page. A button is an invisible patch placed over a significant word or phrase on the window. Since it is invisible, it is left to the typesetting software to mark the phrase as `active', generally by using a different font. Each document style will choose its own convention for displaying important material. When the mouse pointer is moved over a button the cursor is changed to a cross to indicate that a hypertext jump is available.

Click here for Picture

Figure 3.2: A LACE session

Buttons have been placed over each entry in the table of contents, list of tables and list of figures. Clicking on any line in the table of contents or list of tables performs a jump to that part of the document--a button has been placed around each line of the table of contents that sends a request to the LACE librarian for the specified node. Footnotes are treated similarly. Each footnote marker (usually a superscripted number or asterisk) has a button placed over it. When the button is clicked, the text of the footnote is displayed in a new window. References to other parts of the document are also given buttons: clicking on a piece of text that looks like for further details, see section 4.5 will bring up a new window with section 4.5 in it. Citations of other works are treated similarly: the citation markers (like `[8]' or `Rahtz85') are given buttons that bring up the full reference from the document's bibliography. There is currently no method for bringing up that document, even if it is held in the system.

To the right and overlapping with the first page in figure 3.2 is a window containing the Introduction section from the same document. It contains some of the text which is on the first page, but is not finished by the page boundary, containing all the text from section 1 and any enclosed subsections. The long window over the top, slightly obscured by the menu is a footnote window, brought up by clicking the (obscured) footnote marker. Below the title page is an experimental table of contents, produced by the LATEX typesetting software as part of the LACE document style, intended to give an alternative form of graphical navigation. The other windows are section 1, a table and a figure from another document. Appendix 2.1 contains further information on the use of LACE along with details of its implementation.

3.1.5 A Comparison of LACE with other systems

LACE bears some similarities to NLS in its treatment of nodes, since it models a document as an hierarchical structure, any subtree of which is taken as a whole node. Other systems only allow an individual leaf from the tree to act as the node. For example, choosing section 3 from a LACE document will result in the entire contents of that section including any subsections, whereas the same operation in ZOG displays only the node which represents that section, the subsection nodes being addressed by further hypertext jumps.

LACE allows a large degree of access to multiple information types by virtue of its unified data addressing scheme. Many of these types of information are provided by the typesetting software (diagrams, tables, complex mathematical equations, graphs) and are really part of a fundamentally textual document. Video information truly is a different medium, although it is not currently fully integrated into the environment since it is not currently possible to have a video node (or a subpart of a video node) as the source of a link. However, the generality of the addressing scheme allows computational information types (such as dynamic references to tables in a relational database) to be intermixed with text and video.

The large screen which LACE uses to display its windows allow a large body of information to be shown to the reader at once. This allows the reader to work in the familiar book paradigm, using its familiar navigation and browsing techniques (tables of contents, indexes, cross references).

Links are not currently first-class objects, but are simply references to the destination address invisibly embedded in the text. However, much can be done at the time of publication, including building a list of the links into and out of each node. The role often played by a graphical browser is adequately given by the table of contents. Changing the publication process so that links are included in the document's map file is sufficient to make links into first-class objects, allowing the creation of a true graphical browser.

An advantage of LACE is that it allows reusability of each document. The same version that was produced as a paper report can be stored as part of a hypertext network because of the generic nature of the markup that is used to describe it. If hypertext is to succeed as a medium it is just this sort of integration which is necessary, otherwise a prohibitive expense will be incurred in redocumentation. Practical experience shows, however, that authors tend to write for a particular medium, and even for a particular house style, especially if they are not anticipating the reuse of their efforts. Often this involves abandoning logical markup and mixing text formatting commands with the logical markup to gain a specific layout effect. Unfortunately, this meant that it was frequently necessary to alter (although one might say "improve") the document source to produce a suitable hypertext conversion.

A disadvantage of LACE is that is based on a non-WYSIWYG process, i.e. there are two versions of the document: the one that the author wrote (the LATEX document) and the POSTSCRIPT version that the typesetting software created. Because of this it is difficult to make any changes to a document, including adding new links, without resorting to a `recompilation' process. Since the typesetting software also destroys the sense of the document (turning it into a list of characters and white spaces) it is difficult to provide a dynamic querying of the document. LACE makes up for this deficiency by searching the source of the document and returning the appropriate node from the compiled version.

Acrobat, a more recent document environment than LACE, turns each document into a database of objects which can be individually addressed. Unlike LACE, the document is structured not according to its logical content, but its physical formatting characteristics. Acrobat does not attempt to automatically generate hypertext from an existing document, instead it `normalises' the document's formatted representation and provides a viewer which can add hypertext links and annotations to the document. Both Dynatext [46] and Grif [116] (also more recent systems) are similar in concept to LACE, but use SGML style markup instead of LATEX. DynaText directly interprets the SGML markup to produce a formatted physical structure, and so does not suffer from the disadvantages of LACE`s compilation process. DynaText also provides full-text indexing of its documents, whereas LACE only deals with a document's gross structural elements.

3.2 Document Structure as a User's Tool

LACE takes advantage of the internal structure of a document as a navigation facility for readers, but beyond that structure can also be treated as a construction tool for authors. In this section we will further consider systems that take advantage of a document's structure.

3.2.1 Structure as a Tool for Navigation

Cole and Brown [40] are not alone in commenting on the similarities between paper documents and hypertexts. Furuta, Plaisant and Schneiderman [59] describe the conversion of several kinds of documents into hypertexts. Raymond & Tompa [119] describe the production of book and hypertext versions of the Oxford English Dictionary from a single (SGML) source. Others have argued that hypertexts should conform to the same user-interface and navigation paradigms as books [11, 74].

The end result of the authoring process for a paper document is not a database of facts but a discourse or directed presentation whose structure is a part of the meaning of the discourse [135]. LACE exploits this explicit structure of presentation as a navigation tool for browsing the hypertext. Grif [116] and DynaText [46] also perform a similar function for the structure of documents expressed in SGML.

Narnard and Narnard [106] also work with a model of structured documents, but extend it with parallel structures representing knowledge domain concepts and task domain concepts. Explicit relationships (or links) are set up by the author between the various layers of structure (from the document to the knowledge base or from the knowledge base to the task descriptions), so that the system can use these different information sources to improve the reader's navigation around the document layer. In this system, document structure is augmented by a content-based knowledge structure as a navigational utility, and can be used to generate new views of the original documents (synthesised documents). See section 3.2.4 for a description of a system which provides this kind of layering of structures to aid authoring.

Others have also seen structure as a means of navigating a hypertext, even when structure is not explicitly present in the network. Such post-hoc structuring techniques are explored in [18], where graph-theoretical algorithms are used to identify abstractions such as aggregate concepts within a network. Another post-hoc method of imposing structure on an existing hypertext to aid readers is given in [94], where implicit structures are derived from the spatial relationships of a displayed hypertext network. Salton uses textual comparisons to identify similarities between groups of texts, and thus creates a linked text structure to allow navigation [127].

3.2.2 Structure as a Tool for Authoring

A lesson that Brown distils from the development of Guide is that it is important not to set out to build a hypertext system, rather to set out to build a system to help authors and readers of online documents [26]. This section describes the development of a prototype system (LACE-92) which is intended to extend the focus of LACE into the authoring stage of a document's life-cycle, and then describes the authoring support provided by other systems.

Hypertext studies have looked at the mechanisms of joining and intertwining information units, of the mechanics of hypertext jumps, the technology which supports reading and analyses of comprehension. What is frequently lacking is the context of a complete document life-cycle in which to fit the finished hypertext: how is information presented for reading, and how is a document composed from the raw information.

Structure can be used not only to present pre-existing material but also to direct and constrain the authoring process. This section describes an extension to LACE to encompass the authoring stage of the document life-cycle by providing a simple model of the authoring process.

The previous description of LACE has shown how a document's logical structure is useful for imposing a hypertext presentation upon it and why that should be so: a document is created and shaped according to the rules governed by its superstructure. This section covers a model for authoring new material in a hypertext environment. LACE utilised "off-the-peg" literature, while other systems have assumed that pre-written (or pre-planned) material is to be imported chunk-by-chunk [72] in an author-unfriendly fashion. Hutchings describes in chapter 8 of [72] the necessity for a complete storyboarding process to take place away from the hypertext system in order to establish the required structure and contents without confusion. Most realistic writing assignments, however, involve an author starting from scratch without the benefit of a file containing the contents of the finished assignment.

Lace '92 is a prototype system which implements a simple model of authorship in which an author first of all researches (i.e. searches for relevant ideas and information), then chooses the information appropriate to the task, then organises the information within an informal structure. This structure develops iteratively into the logical structure of the authored document, and is used to create a new LATEX or SGML document containing the references and quotations chosen in the earlier stage.

3.2.2.1 The Author as a Reader

When presented with an assignment the author starts work as a reader, searching the literature for all relevant work. This may not just occur as the initial phase of an individual project but may also continue through all following phases of the authoring process. In a hypertext environment this means that all previous works must be easily available, with good browsing facilities and sophisticated information retrieval facilities to be able to deliver a reasonably comprehensive list of relevant material. The phrase "previous works" should include not only carefully edited, published online works, but also more haphazard sources such as personal notes, E-mail conversations and bulletin board messages.

3.2.2.2 The Author as a Censor

Having obtained a body of material (documents or raw facts and statistics), the author must decide which items are truly relevant, which are relevant but uninteresting, and even those which are relevant but "wrong". This process is driven by the nature of the final work and its identified macrostructure (anthology, report, summary, experimental report), the desired conclusion, and the intended audience.

This model assumes that the author is provided with more than enough source material (or information) to produce the finished work, and places the author in the role of a sculptor, chiselling away unwanted material to reveal a work of art beneath.

3.2.2.3 The Author as an Organiser

Having obtained a pertinent set of source documentation, the author must choose how to present this information to the audience. This process consists of two activities: partitioning, in which the common themes are identified and sources grouped according to this thematic classification and sorting, in which items of a particular theme have a partial ordering imposed on them. This ordering is due to the constraints of the document (whatever the medium a reader can only follow one trail at a time) and the constraints of the document's superstructure. For example, if several points are to be presented, it may be necessary to present one before the points which support or refute it. These two processes are iterative, since each theme may be divided into subthemes requiring more partitioning and sorting.

3.2.2.4 Overview of LACE-92

LACE, as has been mentioned previously, used pre-authored literature and presented it in a hypertext environment using information on its internal structure to produce a degree of automated link generation. The authoring had to be done (unsupported) using a text editor, but the goal of LACE-92 is to provide as much support to the author as possible, and to ease structured hypertext authoring.

LACE-92 uses WAIS information retrieval techniques (see section 4.2 for a full discussion of WAIS) to provide the author's set of source documents from network-based information servers. Much of the material would be from pre-published works, and may be in a structured, hypertext format, allowing the chosen sources to act as transclusions or links back to their native hypertext networks. In this way, a new document will automatically be linked into the existing literature from the very start.

The partitioning and sorting described above proceeds on a network diagram, where the overall structure of the network is constrained by the document's chosen superstructure. When these processes have run to completion, the author has produced a skeleton of the finished document. It contains a framework consisting of the points and themes which are to be presented, with the sources placed appropriately. The framework is fully linked with the source literature, and, by virtue of the chosen document structure is also internally linked. The network browser could also facilitate the production of high level summaries, automated tables of contents, citations, references, footnotes, glossaries and other physical document structures.

What goes into the network is mainly "information" and "quotations" which are linked back to the original sources. However the final stage of production is a written document, not a handful of transcluded texts and citations, so it is necessary to do some real writing! This may be achieved in the browser, but the whole framework can instead be exported as an SGML or LATEX document for editing in a conventional environment.

3.2.2.5 Using LACE-92

Click here for Picture

Figure 3.3: The WAIS information-gatherer component of Lace'92.

LACE-92 exists as a prototype implemented in HyperCard, and is based around two HyperCard stacks. One (see figure 3.3) is for information retrieval using the WAIS protocols and a WAIS-like interface and the other (see figure 3.4) for manipulating the relevant retrieved information.

According to the simple authoring model of Lace '92, the author starts with a writing task which involves selecting relevant material from a library of information. This is done by choosing from among a menu of information servers, each typically concerned with a single topic. The field at the top left of the WAIS window in figure 3.3 shows that the user has chosen the database of CACM articles at the Internet site quake.think.com (these are a set of articles on the subject of hypertext which were originally published in [1]). The field at the top right of the window shows that the user has asked for articles about "hypertext" and the scrolling field in the middle of the screen shows the list of articles returned by this query. The user clicks on one of the lines to retrieve the full article which is then displayed in the bottom field.

Click here for Picture

Figure 3.4: The information-organiser component of Lace'92.

The user may select any text which is relevant to the task in hand, and by dragging it offscreen can have it placed in a new field in the organising window (figure 3.4). Along with the selected text, the system stores the details of the chosen article's remote storage address, the server it was obtained from and the offset of the selection from the start of the article. As the user builds up more and more of these selections they can be easily moved around the screen to reflect some form of incremental organisation. Each text selection can have its enclosing field moved, resized or deleted, and the text can be reformatted with single keypresses.

Once a collection of document fragments has been retrieved it is partitioned and sorted by laying the information out as an (informal) network. The network structure of the information is expressed by its position on the screen, with similarly relevant pieces of information being clustered together (this use of spatial layout is also described as a way of creating structure in [92]). To alleviate clutter on small screens, it is possible to select the nodes in a cluster and have them replaced by a named aggregate node which expands to a new (subordinate) network diagram by choosing New Partition from the Lace '92 menu. The network of document fragments should be created according to the required document superstructure, although this is not enforced. The collection, sorting and partitioning phases may go on in parallel and may be iterated many times, but once the user is ready to proceed with the writing phase of the task (combining the collected quotations and evidence into an original work) then the current state of the document can be dumped (in SGML or LATEX format) to a text file by choosing Export Structure from the Lace '92 menu. Not only is the partitioning reflected in the heading structure of the file, but also each quotation is recorded with its reference from the WAIS retrieval, so that this new piece of writing is created with `links' back into a wider body of literature.

3.2.3 Related Authoring Support Work

This section describes a number of systems that have been designed from scratch to provide authoring support and some existing systems which have been retrospectively enhanced to provide better support for authors.

Marshall & Shipman [94] point out that although unconstrained hypertexts are the norm, embedded constraints can aid the coherence and consistency of a network. Many systems which allow unconstrained hypertext, support the author by making it more easy to create consistent structures which are constrained according to some chosen type or model. These constraints may take the form of standard structural components which can be `plugged' together to produce the overall network. Both Notecards and Intermedia developers have recognised the need for this kind of author support. The NoteCards approach is the Instructional Design Environment [79] which provides structure accelerators, in the form of template cards (containing prototype text and links for various styles of card), automatic links (which create a new anchor, link and card of the appropriate type in one operation) and a structure library (which is a global store of named, contentless hypertext structures taken from specific instantiations of networks of cards). Intermedia provides Hypermedia Templates [39] which are a combination of IDE's template cards and structure library, containing both node prototype contents and inter-node links.

Both the Instructional Design Environment and the Hypermedia Templates provide "fill-in-the-blanks" support for authors to be able to rapidly create larger hypertext networks, but provide no support for the cognitive disciplines involved in writing (creating the node content). The cognitive overhead required for producing the hypertext form is lessened, but support is given not for what to say or how to say it but only where to say it.

Various authors have proposed hypertext models containing composite nodes, i.e. nodes which `contain' subnetworks of linked nodes [49, 91, 130]. These models help authors to compose large complex artefacts from smaller subgraphs, also resulting in improved human comprehension of the network structure. The HyDesign model [91] also helps the author to structure the network by providing aggregate links for sequence, hierarchy and group abstractions. De Bra et al [49] introduce other network abstractions: the tower (for representing multiple levels of description of a concept) and the city (for multiple views and perspectives of a concept).

A number of other hypertext systems have been designed to allow explicit representation of structure. Nanard & Nanard describe in [105] their use of MacWeb to allow the description and capture of knowledge-domain information. The MacWeb hypertext kernel provides weakly typed nodes and links; the application built upon it makes use of a separate hypertext network to define relationships between the different link and node types which are to be used in the main hypertext. It is this augmented type structure which is used as a basis of building the hypertext according to a specific pattern of knowledge elicitation.

Kaindl and Snaprud [80] make it explicit that there are two quite separate structures which the author needs to attend to: the structure of the text and the structure of the underlying knowledge. Whereas IDE, HyperMedia Templates and Narnard's systems attempt to structure the text according to (gross) relationships in the knowledge domain, Kaindl presents a mechanism for unifying the two structures in which the hypertext is implemented by a system of frames inside a knowledge representation tool. The rules of the system ensure a close match between the structure of the hypertext and knowledge domains, with appropriate links automatically managed between the corresponding nodes.

Aquanet was designed to allow authors to express structured relationships between hypertext nodes, but experiences from its use [93] show that authors do not rely on a predefined library of network structures. Instead they try to define their own schemas for hypertext structures, often without a full understanding of those structures, and were consequently frustrated by a system unable to support flexible schema modification. Similar problems of premature commitment were also seen by users of Notecards [93] and even users of non-hypertext structured document editors [22]. Aquanet users circumvented this problem by using the main display space as a drawing board, and expressing developing relationships between objects as similar spatial relationships between the object icons (e.g. similar objects may be placed in a messy pile; a `uses' relationship may be expressed by placing the `user' on the left-hand side of the `used'). Marshall & Rogers [93] describe the Aquanet users' various manipulations of representational structure as crucial to the author's interpretive process, and a basis for subsequent writing activity even if the hypertext network and content are not reused.

One of the aims of the SEPIA system [137, 138] is to help avoid problems of premature organisation of unfinished or poorly understood ideas by offering authors more assistance than organising linked nodes of text. It makes a similar but more extensive separation of domains than are seen in Kaindl's system. SEPIA provides an authoring environment based on the ideas of micro-, macro- and super-structures expanded in chapter 2. It provides a number of writing spaces in which different types of writing activity are performed, including a content space for building up semantic networks representing the domain knowledge (as well as notes, excerpts and whole authored texts), an argumentation space for generating, ordering and relating arguments about the knowledge and a rhetorical space for organising and re-organising the global outline, issues, arguments and coherent sentences. SEPIA is more complete than LACE-92 in that it provides a separate activity for the authoring processes associated with each `level' of structure composing the text. LACE-92 only allows the author to manipulate the final `rhetorically and argumentatively complete' text, even while trying to manipulate the basic relationships between the knowledge of the knowledge domain.

3.3 Authored Structure vs Evolutionary Structure

According to Moulthrop, hypertexts must have local coherence within a grain [104]. These grains, or lexias are reading units which can stand in isolation and are the atoms of a hypertext network [87]. In a hypertext environment such as Notecards or HyperTIES, the lexias are the individual cards or screens of information which are linked together to form the larger network. In a document-based environment such as LACE, the lexias may be considered to be the whole documents (their coherence is derived from the author's single presentation). However, we have seen that the documents are themselves complex structures whose subcomponents are linked not only to other internal components, but also to external lexias (via bibliographic citations). In fact particular kinds of document (e.g. newspapers, academic journals, conference proceedings) may exist simply to provide a coherent shell for their diverse but individually coherent components. Other document forms (e.g. dictionaries and encyclopaedias) are also composed of individual components, but have a deliberately coherent style of information presentation with a high degree of internal cross-referencing. In these cases it may be sensible to refer to the components as lexias in their own right (the document would be classed as a composite network node according to [130]).

Stott & Furuta [135] classify hypertexts as browsable databases (hyperbases) or nonlinear documents (hyperdocuments) according to the method of constructing a hypertext from the lexias. A hyperbase is non-intentional (i.e. there is no overall presentation or argument), evolves over time and requires search and query techniques to augment link following to form a useful browsing strategy. Conversely, if the network browsing strategy is largely determined by the structures imposed by a co-ordinated act of authorship then it forms a hyperdocument. Note that this differentiation concerns the network, not the components: a large corpus of documents could form a hyperbase if there was no overall structure to the collection. The following subsections describe several hypertext environments which make use of novel structuring techniques.

3.3.1 HyperSet

HyperSet [112, 113] is a hyperbase model, in which the individual lexias are not connected by links. Instead, lexias are partitioned into sets according to perceived similarities. Navigation consists of moving from one set to another via objects in the intersection of the two sets. Authoring consists of creating new lexias and assigning them to various sets. This kind of hyperbase is recommended for tasks involving taxonomic reasoning which cannot easily be modelled using directed binary links.

3.3.2 StrathTutor

StrathTutor [96] was developed as a computer aided learning package in the late 1980's. It provided a navigation-by-exploration paradigm to its users with no links or anchors explicitly stored with each node or displayed on it. Each frame (or node) of text or graphics was characterised by various attributes which were used to dynamically compute links to the next `most relevant' frame. The attributes were associated with particular regions of the node; the node's attributes were the union of the attributes of all its regions. The attributes were a group of several dozen constant labels which were chosen for each application. The relevance of two particular nodes was calculated according to the number of attributes which they had in common. The user, browsing a particular frame, was presented with a menu of `next frames', ordered by relevance score.

This system is very similar in concept to HyperSet: it also implements a hyperbase without any directed links. Its browsing strategy is similar, but instead of moving from object to object' (as in normal hypertext) or object to set to intersecting set to object', the navigation goes from object to powerset to object, as shown below.

O is the set of all objects, S is the set of all sets, P is the powerset of all sets, and the set of sets to which o belongs is denoted as so. We know that so [[propersubset]] P and the number of elements of so is |so|. The set of sets to which both o and o' belong is given as so [[intersection]] so', and has |so [[intersection]] so'| items in it. A link notionally exists between o and o' if |so [[intersection]] so'| > t (where t is some threshold value, possibly 0). The links are prioritised according to the value of |so [[intersection]] so'|, so that the object with the largest number of common sets is given as first choice for the user's next destination.

For example, if

S = { intro, intermediate, advanced, biology, medicine, mechanics, example, text},

then

P={{}, {intro}, {intermediate}, ... {physics}, {intro, inter}, ... {medicine, physics}, {intro, intermediate, advanced}...{biology, medicine, physics}...}

and

s1 = {intro, medicine, text} |s1 [[intersection]] s2| = 0
s2 = {inter, mechanics, text} |s1 [[intersection]] s3| = 1
s3 = {intro, biology, examples} |s1 [[intersection]] s4| = 2
s4 = {intro, medicine, examples}

and so the most important destination from object 1 would be object 3.

3.3.3 Theseus

Theseus, an educational hypertext project at Liverpool John Moores University UK [68,69,70,67], is a hypertext model incorporating both extremes of the Stott and Furuta classification. It is examined in some detail here because of its dual hyperbase/hyperdocument philosophy

Theseus is not a hypertext system nor strictly a hypertext model, since it addresses few issues of hypertext technology, the implementation of hypertext mechanisms or the behavioural properties which they have. Instead it is a model for authoring `hypertexts' using currently available technology and currently available systems; there is one commercial `hypertext' which the Theseus Project has produced on the subject of `Cytology' using SuperCard and QuickTime on the Macintosh environment [68]. Theseus hypertext organisation consists of two quite distinct conceptual layers: mediabases and subject paths [70].

A mediabase consists of a number of objects (or nodes). The objects are of different kinds and probably different media. They may be simple text strings, whole documents or complete executable applications. The only distinctive requirement for an object is that it must be complete, and able to be used without reference to other objects. This completeness is intended both in the sense of hypertext (objects do not contain links to other objects) and in the sense of meaning (each object should be a self-supporting, standalone statement, amenable to understanding without recourse to other objects). Obviously it is impossible to make an object completely standalone, without reference to any outside information; rather the aim is to have the mediabase populated with `objective' statements which can be `discussed' intelligently on their own standing.

A subject path (also known as a thesis) is a linear sequence which can contain references to objects in the mediabase. Two or more subject paths intersect when they make reference to the same mediabase object. A subject path may mix a lot of (multimedia) information in with the object references, or it may use the object references by themselves. The subject path may be retraced either a step at a time, or in one jump to the start of the path (thus the analogy of the legend of Theseus). The user maintains a personal index of possible future paths in another subject path. The set of subject paths makes up the subject layer.

A Theseus `author' is a person who has something to express about a topic. These personal viewpoints are related to a wider frame of reference to a public database of archived materials (the mediabase). Each personal viewpoint becomes braided to other personal viewpoints via the objects it points to in the mediabase. Each viewpoint is formulated according to the goals, tastes and understanding of its author. Each objective statement in the mediabase acts as a focus for different arguments within the set of personal viewpoints. An mediabase object is given significance by an author's use of it; and that use `grounds' both the object and the thesis in a network of associations.

Moulthrop proposes hypertext as a deconstructive medium which should be used to restructure and relink texts, and in which the meaning of a text is found from its relationship to other texts [102]. This is seen in Theseus, where the meaning of a text (or object) is defined in terms of the subject paths that include it. The hypermedium is seen to evolve from the intersection of subject paths, not the intersection or multiplicity of objects, in fact the function of an object is as a site for these new intersections. The mediabase is a database of objects: somewhat like the Furuta hyperbase without connecting links. The subject paths on the other hand are hyperdocuments that apply intention and connection to the hyperbase.

One of the problems of Theseus is that it does not support reflexive information linking (a subject path cannot refer to other subject paths). The consequence of this is that every thesis made about a group of objects stands as written and cannot be annotated, commented on, criticised or supported. In fact, a thesis cannot even be referred to except as an indirect consequence of referring to its objects. This is a serious consequence of the model, but receives no mention in the literature except the statement that the the subjective understanding displayed in any thesis is equally valid.

Another potential difficulty in the Theseus model is that (by implication) when referring to an object, the complete set of subject paths that also refer to that object are visible. Here the analogy with the legend of Theseus breaks down since the user may be presented with dozens or hundreds of alternative loci, not simply the two or three that are seen in a maze. The lack of reflexive linking prohibits managing this situation by providing summary or partitioning structures within the `hypermedium' to constrain and direct user browsing.

3.3.4 LACE

LACE, Grif and Dynatext provide a mechanism for expressing local coherence, and a means of representing complex structured lexias. They are in one sense hyperdocument models, but do not necessarily provide a facility for extending Moulthrop's idea of coherence beyond the document-specific sub-lexias. As such they provide for a hyperbase of hyperdocuments. LACE-92 extends LACE by providing a mechanism for creating locally coherent, complex, structured lexias from information in a global hyperbase. In the following chapters we look at the challenges of providing a genuinely global hyperbase, of the systems and standards that can be used to good advantage in such an environment and a new document architecture (LACE-93) which can be used extend coherence beyond a local document level and so provide global hyperdocument facilities.

4. HYPERTEXT IN THE LARGE

"Hypertext in the large" is hypertext concerned with a global context--not just a single text or group of intimately related texts, but hypertext as a universal literature resource. Previous chapters have looked at hypertext representations of individual documents or local document collections with a controlled environment and context, but problems of scalability and maintainability mean that the same systems and mechanisms may not be suitable for very large or highly distributed corpora.

Keeping the node and links model intact but extending the node addressing scheme to allow remote nodes and defining a node transport mechanism allows a hypertext system to be extended to operate across a network. This is the approach taken by the World-Wide Web project .

Alternatively, the node and links model may be abandoned, and large-scale textual connections may be achieved by "on-the-fly" machine searches implemented in text archives and document retrieval systems. This is the approach taken by the WAIS project .

The approaches of text-retrieval and hypertext links are compared and contrasted and a compromise, applying loose links to a flexible domain of documents, is discussed. This is the approach of the Microcosm project.

The effect of these three hypertext paradigms on hypertext production and maintenance is discussed, and the use of Microcosm loose links within the World-Wide Web is discussed.

4.1 World-Wide Web

World-Wide Web (WWW) is a hypertext system which keeps the node and links model intact but provides distributed hypertext by extending the node addressing mechanism and defining a node transport mechanism [12]. WWW's appeal is that it defines a basic document architecture capable of simple visual interpretation with support for embedded graphical objects.

The Web is based on a client-server model, in which a standard transfer protocol (HTTP, or HyperText Transfer Protocol [14]) is used to communicate hypertext documents in a standard format (HTML, or HyperText Markup Language [13]). The client simply interprets the document's markup to provide a visual rendering of the document, to maintain a history list of the user's recent session, and to enter a dialogue with a remote server to obtain the destination document when the user activates a link. The URL (Universal Resource Locator, [15]) which is used to specify the document which serves as the link destination is composed of four parts (see figure 4.1). It is the client's job to parse the first part in order to use the correct retrieval protocol (usually HTTP) and the second part to establish a connection to the correct server host. It is the job of the server to interpret the third part (the path) to produce a document. The client displays the document returned by the host, and jumps to the part of the document labelled by the (optional) fourth part of the URL.

protocol://host/path#name

http://bright.ecs.soton.ac.uk/ResearchJournal/paper1.html

file://bright.ecs.soton.ac.uk/pub/papers/im/mcm.ps

Figure 4.1: Universal Resource Locator Definition and Examples

The path part of the URL is cast in terms of a path in a hierarchical name space and by default this path is interpreted as a file name relative to the server's `home directory'. If the path represents a directory, not a file, the server may return a directory listing with the names of the files as links which, when activated, return the actual files themselves. Similarly, a part of the name space can be used to specify a program to be run and the arguments to be passed to it, so that the user is returned a document `composed on the fly'. The URL's path component can also be interpreted as representing a document and a query string that must be matched against that document. The server may even effect a gateway to another information service (such as WAIS or Gopher). The client is ignorant of the different alternatives: it simply uses a URL as an address for retrieving a piece of information.

4.1.1 Document Structure

HTML provides a simple document architecture which consists of several levels of heading, various lists, paragraphs, different kinds of emphasised text and hypertext links (see figure 4.2). HTML can be expressed in SGML, and an HTML DTD is available. However, since there are no HTML programs (client or server) which actually make use of an SGML parser, the majority of HTML documents used on the Web do not conform to the DTD, mainly because of petty details such as the inclusion of formatted text in link anchors. Because of this, there are various versions of the HTML DTD which correspond to various levels of `strictness' in HTML conformance.

HTML is undergoing various revisions: the current common form is known as HTML+, which is basic HTML with added tags for defining forms, or in-document dialog boxes. Various other document structures such as abstracts are also being added to the basic HTML repertoire.

<head><title>Example WWW Document</title></head>
<body>
<H1>Important Information</H1>
This is an example WWW document which contains a
<a href="http://site.edu/docs/mydoc.html">link</a>
to another document.
<p>
The word &lq;link&rq; in the previous paragraph is
an anchor which would appear highlighted on the
display. If the user were to click on the anchor,
the document <b>docs/mydoc.html</b> would be retrieved
from the computer called <i>site.edu</i> on the Internet.
</body>

Figure 4.2: HTML Example

It is interesting to consider the structuring options available to the Web author. By taking advantage of the HTML architecture a text can be written as a single, coherent document, with internal cross-references. Alternatively, the text can be split into many nodes and the text's structure inferred from the links between the nodes (HTML does provide a link attribute to explicitly code the relationship between the source and destination of the links, but it is largely unused by the browsing software).

One of the factors that influences this choice is whether HTML is the original authoring environment for this particular text--a translation from another textual or hypertextual environment may dictate the preferred structuring paradigm. Information from a card-based system like HyperCard may be naturally chunked, whereas a Word document may remain as a single, complex entity. The latex2html program which is used to convert LATEX documents into HTML gives the author freedom to specify the degree of chunking--whether new nodes are to be started for each subsection, or section or chapter.

4.1.2 Network Navigation

Dynamic naviagtion of the Web by a user is undertaken in one of two ways:

* from any given document the user selects a linked document by clicking on an anchor, so that a path of nodes is traversed in order to reach the intended destination

* the user can `jump' to the exact document required by specifying the known address (URL) of the document.

The former mechanism requires the user to follow semantic cues in the contents of the documents in order to repeatedly choose the correct links to follow. The latter requires the user to to make use of an already-known address, which may come from:

* a compiled-in list of well-known documents provided with the Web client viewer software

* a short hotlist of remembered addresses of previously visited documents which were deliberately noted by the user

* a comprehensive list of every document seen by the Web viewing software

* outside the Web environment: new sites advertising their URLs on other electronic services (mailing lists and network news), or word of mouth from colleagues

i.e. apart from link following, it is only possible to navigate to a document if you have already been there, or if you are provided with a handle to it by its author or by someone else who has been there. This is then a `pure' link-following environment, without recourse to text searches or comprehensive document catalogues; it is almost impossible to navigate the Web with the aim of finding all documents about a particular topic.

Users of the Web will be aware that documents tend to fall into two broad categories: content-bearing documents about a particular topic and catalogue documents which contain no subject-domain information themselves, but do contain many links to other information sources. These other sources may themselves be content-full documents, or may be further catalogue documents. The significant point is that it is the catalogue documents which contain pointers to external resources and not the content documents. The content documents usually contain pointers within their local document structure (especially when a single document is implemented as a tree of nodes), but few if any pointers to other relevant works.

4.1.3 Document and Network Analysis

By writing an automated Web browser which negotiates the HTTP protocol and analyses the HTML markup it is possible to get some information about the use of structure within the Web. This structure may be the localised, intentional structure of a single document, or non-intentional structure which becomes apparent in the patterns of authorship and administration of the global Web.

Unlike small-scale hypertext systems, it is not possible to enumerate the nodes participating in the network before visiting them. Neither is it possible to enumerate the links in the network since they are stored inside the nodes. Effectively the network unfolds through exploration: a starting point is required, from which one obtains a set of linked nodes, and from each of these further links are discovered. The process of obtaining a single node from the network takes a certain amount of time: each stage in the process (establishing a network connection, starting a remote HTTP server, extracting the node from a disk file, and transmitting it across the network) may take hundreds of milliseconds. Experience shows that several seconds are required to retrieve a typical node, given a lightly loaded server and a quiet network. Visiting every node in the Web is likely to take many weeks (at the time of writing) during which time the Web will be changing--it is impossible to take a completely static snapshot of the network. Even if time were no obstacle, there is no guarantee that every document which is part of the Web could be reached from an arbitrary node.

Although these constraints do not allow us to gain a complete picture of the Web, we can be confident that it forms a hierarchy to a first approximation. This is because the URL space is composed of a hierarchical Internet site name space combined with a hierarchical path name space. Beyond this, each document is contained by HTML's hierarchical document architecture.

Given the constraints on perceiving the Web above, some work was undertaken by the author to analyse the structures of the Web and the patterns of authorship seen. First of all a simple Web client called wwwgrab was written which takes a URL as an argument and retrieves the node addressed by that URL. Then a UNIX shell script (hyperfind) was written which takes a URL, invokes wwwgrab to fetch the document, analyses it and then recursively invokes hyperfind on all the URLs given as link destinations (this is an example of a WWW browsing program known as a spider). Different versions of hyperfind were tried: some which limited their exploriations to a particular site or internet domain in order to gain a detailed and in-depth picture of a localised region of the Web and some which followed out-of-site links by preference in order to gain large degree of coverage of the total Web, instead of becoming bogged down in a particular site's document archive.

The hyperfind script was run on the URL of a known WWW catalogue (the WWW sites list maintained by the National Centre for Supercomputing Applications at the University of Urbana-Champagne in Illinois) and on the URLs of several major Web sites (Cern in Switzerland and JNT in the UK). This exercise produced a list of some 12,000 nodes at 600 sites over a weekend. In order to be able to obtain some understanding of the structure of the Web beyond the first approximation of a hierarchy (above) a number of these sites were displayed graphically. A hierarchical visualisation of the results for a typical site are displayed in figure 4.3 (the data were extensively pruned to show only the relevant parts of the Web within this single site).

Examining this figure we see that there are about a dozen links from the home page to information about the Web project itself, general information about the city, the department and departmental events, specific information about the research groups within the department, private information for departmental personnel, pointers to other information providers within the University and also meta-information about the Web and other Internet services. The part of the Web relating to departmental research groups has been expanded to show six research groups, of which the Formal Methods group has been focussed on. The formal methods group contains links to each of eight of the academic personnel who compose the group, as well as a number of the projects being undertaken. Following the link to one of the academics shows a number of entries to information about a journal, abstracts of a couple of papers, lists of text books, a link to a (Gopher) mail archive and an FTP-able standard definition.

It seeems to be highly significant that having reached the bottom of the hierarchy at this point, where one would expect a significant amount of data to reside, the Web documents are either information about entities external to the Web (journals, textbooks, academic activities) or references to data which is held outside the Web proper (mediated by Gopher, FTP or even "snail mail"). In particular the two papers which are mentioned are not available in HTML: they must either be downloaded from an FTP archive in compressed PostScript form (and typically printed), or else they must be requested by paper post.

What these initial results seem to show is that currently most content-full documents are actually not Web-native (stored in HTML format and mediated by HTTP), but FTP-native (stored in a possibly compressed POSTSCRIPT format, mediated by FTP), and the the Web is used to provide an accessible route to these documents. The hypertext features of the Web actually implement a user-oriented navigation structure placed on top of the more primitive FTP archive or hierarchical file system; that navigation structure is based on a familiar `prospectus' metaphor (this is our organisation; here are the departments; here are the people who work here, their CVs and pointers to their papers and documents describing their research/commercial activities).

4.1.4 Drawbacks

Despite its distributed, multi-site nature, the Web provides no tools for collaborative authoring. In fact, apart from simple document structuring, there are no tools for authoring at all. Beyond a view of the component lexias it is not possible to manipulate the Web's network, or even view it graphically. In effect WWW is a presentation system for distributed heterogeneous information sources which includes a simple hypertext navigation metaphor.

Click here for Picture

Figure 4.3: A snapshot of the Web at a typical site

One of the drawbacks of its simple embedded links model is that it does not allow easy hyper-document maintenance. Since the links are cast in terms of a URL which gives explicit document location information, any change to the organisation of the destination site will invalidate the links. This is not an uncommon occurrence--the above analysis seems to indicate that about 3% of links are inoperative.

Another problem of the simple embedded links model is that of deadends: only HTML documents and images can contain links. Although links can point at other kinds of documents which the client will arrange to have displayed by the appropriate native viewer, these documents can not contribute to the Web--they are dead ends with no link following possible.

The embedded links model gives another problem to authors: how can you construct a document so that it contains explicit embedded references to all the data destinations that are required. It is not uncommon to find Web documents where English discourse has been replaced with a list of "click here to see X" phrases.

The problem of topic-based navigation of the Web is similar to the problem of finding a file on a particular subject among the Internet's anonymous FTP services. In that environment at first enthusiastic volunteers published regular lists of sites and kinds of files at each site. Some sites also used to provide a file containing a complete list of all the files available from their machine. Eventually a single site provided a database of the names of files available at all of the well-known anonymous FTP sites; an interactive query service (known as archie) allowed any user to find out where a file was archived given a fragment from that file's name. This service has now been replicated across several dozen sites across the whole Internet, so that any user can obtain a list of potentially relevant files as long as the name of the file is indicative of its contents. A similar system could be applied to the Web; already software is available to allow the administrator to automatically catalogue each of the Web server's files.

A number of informal attempts are being made to provide similar services for the Web. Several programs like hyperfind (the generic term for a program which travels the Web is a spider or robot) have been used to create databases of URLs and document titles; a single URL is provided which gives a fill-out form to indicate the keywords which interest the user. This is matched against the database, and a list of clickable document titles is returned.

The problem with spiders is that they are too intrusive and take too long to run. One of the most well-known spider databases is currently running 5 months out of date. An alternative approach is to provide a shared database which users can voluntarily populate with information about their sites. This has been provided by the so-called Virtual Library project [97], which advertises a single URL corresponding to a fill-out form which can be used to register documents according to an evolving classification system. The main drawback with this approach is that it is a voluntary scheme, relying on authors providing information about their documents, and as such is not particularly well used.

The conceptual problem with all of these navigation services is the same as the general Web navigation problem: in order to use these services a user has to know that they exist. In order to discover their existence they must probably read about them from an alternative (broadcast or multicast) information source.

4.2 WAIS

The approach of the WWW project is to maintain a simple nodes and links model to allow distributed access to hypertexts. As we have seen above, this has not yet led to an increase in the connectivity of individual documents to the reader's advantage.

An alternative approach to large-scale `hypertext', as seen in the Wide Area Information Server (WAIS) products from Thinking Machines, is to do away with fixed links, and to maintain instead a distributed registry of nodes and their attributes [20]. For the user of such a hypertext environment, link following is exchanged for "on-the-fly" database searches in node registries. In the case of WAIS, the attributes of each node stored in the registry is a complete inverted index of the node's contents; link following is supplanted by choosing documents which are considered relevant to the current node. This model is similar to that of StrathTutor, except that the node attributes are not assigned explicitly by an author but are inherent in the node's text. The relevance rating of each possible destination document is calculated according to the similarity of terms in the text of the potential destination and the current node. (It is possible to have the relevance measure performed on a subpart of the current node to focus the reader's interest.)

WAIS does not provide many of the facilities of a hypertext system since it is really an information retrieval environment. Typical GUI-based WAIS clients provide separate fields for reading documents and typing `queries', although minor modification to the front end would provide pseudo-link following by allowing queries to be expressed by selecting text from the document.

WAIS is a networked client/server system, in which the client document viewer sends a piece of text to the server. It is the server's job to analyse the text and to score the relevance of documents it has registered. It then returns a shortlist of document titles and scores to the client, which allows the user to make a choice. There are many servers available on the network which serve different document resources (most of which are subject based), but there is a chicken-and-egg problem here: in order to obtain a selection of relevant documents from a large set of documents the user must first choose a relevant server from a list of several hundred possible servers. One way of doing this is to send a proto-request to the (so-called) server-of-servers, containing a list of keywords which identify the subject area that the required server must be registered for.

Producing a document is trivial with this system: there are no links to add, nor any particular document format to attend to (WAIS is mainly used for simple textual documents). Any document (or selection from it) can be used as a source for triggering links (or relevance matches). In order to be considered as a destination for a match, a document must be submitted to a server and indexed.

The advantage of this approach is that any document can be instantly linked into the server's corpus by virtue of its textual similarities. Another advantage is that it is resilient to document addressing or organisational changes. All `links' are generated by the server and not stored by the clients. If a document is altered or removed from the server's collection then the server need only perform a re-indexing operation to maintain consistency.

This model of hypertext navigation is also seen in Salton's work [127] where a linked structure is superimposed over texts and based on the textual similarities of the texts. In Salton's work the links are explicit, whereas in WAIS they are implicit; in both cases they are automatically generated from the text contents.

4.2.1 Textual Linking and Automated Linking Experiments in Lace

In describing WAIS we have seen the replacement of links by dynamic text matching operations. It is possible to consider, as Saltons' work, such operations as defining rather than replacing links for a more traditional hypertext system. The previous chapter's description of LACE showed how it allowed hypertext browsing based on a document's explicit structural information. Here we describe previous attempts to extend LACE to allow hypertext browsing based on a document's implicit semantic information, as derived by text analysis techniques. The aim of this is to provide better `automatic' hypertext generation. These searches may be done interactively, or as a batch process to predetermine applicable links.

Hypertext links may be authored explicitly between two items in a document corpus, however this activity requires both creative, intellectual effort from a human author in setting up the links and non-creative, intellectual effort to maintain consistency among the links as the document corpus evolves. The need to minimise the load on the author and maintainer of such an interlinked document corpus has led to various efforts at automating the linking process, leaving the responsibility with the computer to generate and maintain the hypertext. This has important knock-on effects, as it allows an already existing literature to be integrated into a hypertext environment with (in theory) minimum effort.

In a hypertext environment which requires the author to create links manually the author has to choose both the source and destination of the link and then to invoke a linking operation. This may be done in a visual fashion by directly manipulating the contents of the document items to be linked, or else indirectly by naming the ends of the link. The main concern of manual linking is to correctly identify the address of the links' endpoints within a set of documents.

Automatic linking relies on a computer being able to derive suitable places within the document corpus to act as link endpoints. This may be done as a batch task to compile a "definitive" set of links between all suitable pairs of document items within the corpus, or else as an interactive search for all suitable destination endpoints from a specific document item selected by a reader. Whichever of the two methods is chosen, the identification of suitable endpoints is performed within various domains: syntactic, lexical or semantic.

It is possible to create hypertext links by identifying various superficial textual features common to various types of written information. Technical writing may employ phrases such as "see table 3" to indicate internal cross-reference or indicate an external citation by adding it as a parenthetical comment after the information to which it refers "through the use of SGML (Barron 1990)." Different publication bodies may vary the exact representation of these "link anchors", but all will aim for consistency so that the reader may easily understand what is being indicated. This has obvious advantage when a computer is brought to bear on the text, since the links may be recognised by the use of a simple set of regular expressions without any attempt to analyse the meaning of the character strings.

cross_reference ::= see ([Ff]igure)|([Tt]able) [0-9]*(\.[0-9]*)?
citation ::= \([A-Z][a-z]* 19[0-9][0-9]\)

Figure 4.4: egrep-style regular expressions for two types of link source.

Not every style of writing uses explicit references as above, so syntactic analysis is not a universally applicable technique, however most documents available in computer-readable form are of a technical nature, so it affords a useful first attempt.

Once the source of a link has been discovered, it is necessary to identify the link's destination. Given a reference to an internal item, it is easy to find that item from its heading or caption. However, citing a reference to an external document (external to the document, not to the corpus) implies a global naming scheme which would be too cumbersome to quote in the body of the text. Usually citations are a key to be looked up in a list of references at the end of the text. It is this bibliography which provides everything necessary to obtain the destination (given sufficient motivation).

We have seen the necessity for modern hypertext systems to be able to use existing sources of documentation if they are to be useful in a practical way. LACE extended the notion of reusability by allowing the automatic creation of links from the original (flat) document. However, LACE only dealt with links created automatically as a side-effect of a document's logical markup, i.e. based on the structure of the document, but there is a lot more information in the content of the document than the markup. This section outlines an initial study of the literature and together with the results of some preliminary experiments to extend LACE to find ways of creating automatic links based on the information-rich document content. as well as the explicit document markup.

The literature deals with two approaches to document content: classification and indexing. The approach of the former is to relate the knowledge contained in a document with the universe of knowledge, specifying how the document is both similar and dissimilar to other documents. The latter highlights useful concepts contained in a document so that a reader may gain direct access to pertinent information. The process of indexing is analogous to that of creating links for a hypertext document: the latter is making specific references from one document to another, the former is creating a table of `half-links' which are only to be resolved when the literature is being read.

Obviously classification and indexing are not independent, since the choice of the set of index terms will depend upon the subject to which document addresses itself. Classification is difficult because it requires the ability to model the information contained in the document, whereas indexing requires only the ability to identify the important words and phrases which are associated with those concepts.

It is hoped that there is a method for extracting a set of index terms from a document or set of documents which is efficient (it should not require undue computing resources), general (it should work on literature from any field) and accurate (it should not produce insignificant terms nor miss out important ones). Given a set of index terms for each document in a hypertext system, any reference to such a term in one document would be linked to all other documents which also refer to it.

4.2.1.1 Simple Frequency-Based Index Model

The basic thesis of automatic text analysis is that the frequency of a word's occurrence is a measure of its significance. Work done by Luhn [89] demonstrated that if a text is examined and a graph is plotted of the frequency of each word's occurrence against its position in rank order then a hyperbolic curve is obtained. The graph is notionally divided into three sections:

1 the area to the left of an upper cut-off point which is populated by very common words

2 the area to the right of a lower cut-off point which is populated by very rare words

3 the area in the middle which contains words which make up the significant content of the document

It is assumed that overly-common words are insignificant (the, to, of...) and that overly rare words do not contribute to the content of the document. The resolving power of words (the ability to discriminate content) is supposed to reach a peak between the two limits and fall off rapidly towards those limits. However, the position of the cut-off points can only be determined by trial and error, so an alternative approach is generally used, which is to filter the list of words through a set of `fluff' or `stop' words--those words which are known to be `contentless'. After this, the resulting words are stripped of their suffixes to match the equivalent stems.

4.2.1.2 Probabilistic Index Model

An alternative approach by Bookstein, Swanson & Harter [16] is based on statistical observations about the distribution of `content-bearing' (or speciality) words and `non-content-bearing' (or function) words in a text. They showed that the distribution of a function word w over a set of texts is modelled by a Poisson distribution, i.e. for a given function word, the probability f(n) that it will appear n times in a text is given by Click here for Picture where x is interpreted as the mean number of occurrences of the word over the set of texts. Hence any word that does not follow this distribution can be assumed a speciality word, and used for indexing.

4.2.1.3 Co-occurence Index Model

Curtice & Jones [45] observed that words which occur freely in any text environment are less suited to serve as index terms than those whose environment is detectably constrained. For each word i in a text, measuring fi, the frequency with which it occurs, and Ni, the number of words with which it co-occurs, allows us to create a new statistic Click here for Picture . This is a measure of a word's promiscuity, and plotting it against Ni yields a scattergram which allows us to determine the words which tend to be associated with fewer terms and hence are more discriminating and (hopefully) more interesting.

Stone & Rubinoff [125] use co-occurrence statistics to distinguish between core terms which occur in all documents throughout a given field and particular terms which discriminate subfields within that field. First an indexing vocabulary is obtained from the complete set of documents: this represents the kernel set. This set is then expanded by taking each discarded word and computing its `association' with each kernel term. If the sum of these associations is greater than some threshold then the word is added to the list of particular terms. This step is repeated several times with successively higher thresholds to find those discarded words which associate strongly with the core and particular terms.

4.2.1.4 Sources of Index Terms

Caras [33] concludes that although index terms can be gleaned from the title of a document, it is better to use the abstract as a source for terms.

4.2.1.5 Use of Index Terms

Silverman & Halbert [133] define two measures of a document

sophistication how different the document's pattern of index terms is compared to the pattern of index terms in the universe of documents to which it is related

pertinence how the pattern of index terms compares to the pattern of terms in a user's query

They maintain that given a query and a set of documents that in some way satisfy the query, the ideal starting point is a document which in its own field contains common concepts, i.e. which is unsophisticated. Taking all the documents which match the query, rank their (sophistication, pertinence) indexes and start off in the (low sophistication, high pertinence) quadrant, working through to the (high sophistication, low pertinence) quadrant as the user expresses interest.

Nelson [107] describes how IBM required their researchers to fill out a profile, detailing in their research interests in their own technical language. This profile was used to match against the descriptions of new papers and books that IBM process on a weekly basis. The technique has obvious extensions for browsing in a hypertext system, by allowing the user to create a typical document detailing their research interests. This document can be evaluated by various of the above techniques to produce `filter' for general queries (in fact, this is the method used by many WAIS front ends, as described in section 4.2 ).

4.2.1.6 Index Exhaustivity & Specificity

Zunde [149] draws the distinction between an external index, whose target is a single document, and an internal index, whose target is a specific piece of information within a single document. (Within LACE there is no distinction between the two operations: specifying a particular subdocument is equivalent to asking for the whole work.) His research showed that for internal indexes, only 35-40% of possible index terms are used (the exhaustivity of the index), and that only 32% of the occurrences of each term are listed (the specificity of the index).

The need is apparent to increase the usefulness of an index by increasing both these figures, although indexing every occurrence of every term would be counter-productive, providing too much material with a low signal-to-noise ratio. There is a need to balance representation, where the document is fully described by its index, with discrimination where the unique features of the document are highlighted.

4.2.1.7 Preliminary Experiments

The goal of the experiments is to see if it is possible to identify a combination of the above techniques that will allow a useful set of index terms to be found automatically from a set of documents. Such a set of index terms is a set of significant document features, and can be used at least as a list of keywords to provide an extra navigation structure for a group of documents. Hopefully, by combining the instantiations of the terms, it would be possible to define links between relevant parts of documents.

The documents in use for these experiments are a month's CEEFAX news bulletins. These are composed of 585 individual news items, each about a paragraph in length. Appendix 2.4.1 lists the words appearing in the items in order of their frequency. All told there were 5535 unique words (no suffix stripping was done, so plurals are counted as separate words) appearing a total of 41,628 times. Appendix 2.4 contains a graph showing the frequency of occurrence of each word plotted against its rank in the frequency table. This graph conforms to the ideal plotted mentioned in subsection 4.2.1.1 although it is a more extreme example of an hyperbola and it is not clear where the cutoff points should be drawn. (The displayed graph has been `zoomed in' on the interesting part. Zooming out to see the complete set yields a graph which hardly leaves the axes.)

Table 4.1 shows the probabilities expected for a function word as given by Bookstein, Swanson & Harter. The figure at co-ordinate (n,m) in the table is the probability that a function word will occur m times in a particular document, given that it occurs n times in all 585 news items.

The table shows that no statistical significance can be construed for words that appear less that several hundred times in the complete set of news items. Unfortunately, Appendix 2.4.1 shows that most of the words which occur so frequently are obviously function words (the, by, are, been, for...). It would appear that the individual documents are too small to be usefully treated by this method.

2975 2000 1000 500 200 100 50 10 5 1

1 .0314 .1119 .3093 .3635 .2428 .1440 .0784 .0168 .0084 .0017
2 .0799 .1914 .2644 .1553 .0415 .0123 .0033 .0001 .0000 .0000
3 .1355 .2181 .1506 .0442 .0047 .0007 .0000 .0000 .0000 .0000
4 .1723 .1864 .0643 .0094 .0004 .0000 .0000 .0000 .0000 .0000
5 .1753 .1274 .0220 .0016 .0000 .0000 .0000 .0000 .0000 .0000
6 .1486 .0726 .0062 .0002 .0000 .0000 .0000 .0000 .0000 .0000
7 .1079 .0354 .0015 .0000 .0000 .0000 .0000 .0000 .0000 .0000
8 .0686 .0151 .0003 .0000 .0000 .0000 .0000 .0000 .0000 .0000
9 .0387 .0057 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
10 .0197 .0019 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000

Table 4.1: Probabilities of Distribution of Function Words

The Curtice & Jones method does not define the criterion that two words must satisfy in order to co-occur. We have used the simple definition that two words co-occur if they both occur in the same sentence. From that definition, a scattergram can be obtained by plotting Ri against Ni. Their paper [45] indicates that the resultant points are scattered around a straight line of negative slope, noting that words corresponding to good index terms are those which appear below this line, i.e. words which appear with fewer others than expected. Applying this to our experiment results in a line of equation y = -0.0015x + 9 provides a good fit. Taking words with maximum distance from the line yields the following encouraging `Top-30' list (some of the vocabulary is explained by the fact that an ambulance strike was just occurring at the time).

ambulance contaminated emergency man ship
ambulances crews farms ms site
armed criminal feed NatWest unions
calls Deng ferry party walked
China department German Philips Walker
collision dispute London plant were

4.2.1.8 Conclusions

It appears that many of the methods given in academic research journals may give impressive results for a limited set of documents covering a certain domain of knowledge, but applying them to a different set of documents from a different field gives disappointing results. This may be because the methods have been biased toward the experimental data leaving them unsuitable for general application. And so still today (some 20 years after the publishing of these papers) indexing is either done manually, or by providing complete inverted indexes of entire online texts. The encouraging results above required large amounts of computation to achieve, and so this work was postponed. See [88] for more work on computing hypertext links using information retrieval techniques. In fact, more recently, work is being carried out to improve information retrieval by using the author's added hypertext links [44, 57], rather than vice versa. Because of these disappointing results, LACE was not augmented to allow automatic links based on document content.

4.3 Microcosm

A combination of the two above opposing approaches (rigid links or no links) is to allow `loose' link specification, where a link may be applied to a flexible domain of documents. This is seen in Microcosm [47, 55], where explicit sets of authored links may be applied even to documents subsequently added to the hypertext collection. This section briefly describes Microcosm, explores the differences between text-retrieval mechanisms, traditional hypertext links and Microcosm links and proposes a simple computational model for expressing Microcosm link semantics. Finally the use of Microcosm in conjunction with other hypertext paradigms is investigated. (Other systems with some similar features, notably the separation of document and link data, are briefly described in sections A1.4. and A1.14.)

Microcosm is a hypermedia system that has been developed and used at Southampton University over a period of five years. It has evolved considerably over that time, but retains the fundamental model of a group of co-operating processes communicating via message passing which together supply various hypermedia environment facilities. Its main features are

* a selection-action paradigm for user interaction. Fixed link anchors (or buttons) are simply an author's predefined binding of a particular selection within a document to a particular hypertext action (such as follow link). In general, readers of a Microcosm hypertext can invoke a range of hypertext actions on arbitrary selections.

* links held externally to the documents they reference. This allows links to be made between the native documents of third-party applications, such as wordprocessors, spreadsheets, databases or CAD packages.

* a message passing framework, into which various document viewers or hypertext link servers (also known as filters) may be slotted. This framework is a circular chain in the current implementation, so a message will be received by the next application "downstream". This receiving application may process the message and block it, pass it on unchanged, or modify it in some way.

* a message format for coding information about user requests or hypertext facilities between the various components of the above framework.

* a document manager which associates document ids with file names and a set of other attributes (such as title, author, keywords, description)

In order to see how the components of Microcosm function together, consider how a link is followed. The user makes a selection in an open document in a viewer application, and then chooses the menu action "Follow Link". The application parcels the selection, its position within the document and the document's identifier into a message which is sent down the chain. A link database intercepts the message, looks up any links that correspond to that selection, and sends a message containing a specification of those links down the chain, along with the original link request message (possibly to be intercepted by further link databases). Eventually, all the link specification messages are intercepted by a dispatch filter, which presents the user with a dialog box containing descriptions of each of the applicable links. The user selects a link and the dispatch box sends a "Dispatch Link" message to the appropriate viewer. The viewer intercepts the message, opens the appropriate document and highlights the destination selection.

The user interface to link authoring is very similar to that of Intermedia: a source is selected and the user chooses "Start Link", and then a destination is selected and the user chooses "Complete Link". How this differs from Intermedia is that the link is expressed not as a simple source point to destination point relationship, but as a mapping from a source selection in a particular context to a destination selection. When the user creates the link there are options to specify the context of the source selection: either this exact place in the source document, or any place in that document, or any place in any document. This choice allows the user to create a specific, local or generic link.

A generic link, the most common link type in Microcosm hypertexts, allows the author to associate a document with any occurrence of a particular textual string in any document. At first sight this may seem to be just a text retrieval operation, however there are certain key differences. Firstly, from a practical point of view, a generic link requires no indexing of the possible destination documents, nor a searching operation on every document in the hypertext in order to satisfy the link--a generic link has none of the overheads associated with text searching. Secondly, the difference between generic links and text retrieval is the difference between intentional and non-intentional hypertexts: a link expresses an author's knowledge of a relationship between the meaning of two entities in the hypertext, whereas a text retrieval operation expresses a statistical similarity in textual features of two hypertext entities. It is possible to liken a generic link to a text retrieval operation in reverse: a generic link works for a single destination and specifies a collection of applicable sources, whereas a text retrieval operation works from a specific source and describes a collection of applicable destinations. (See section 4.3.2 for more explanation about the declarative nature of generic links.)

The significance of generic links is particularly apparent to hypertext authors. When authoring links in many other systems, the question that a hypertext author is constantly posing is "where can I go to from here?". The answer to this question is usually `destination anchor a of destination node A'. For every piece of information (node) that is added to the system the author must add links that leave it, tieing it into the hypertext corpus. In Microcosm the author's perspective is changed, and the question becomes "from whence should I be able to come here?" or "what characteristics must another node have in order to link here?". This is may simply be a reposing of the original question, with the answer being `source anchor a in source node A', but is usually far more general such that the answer is in terms of multiple sources (frequently constrained by a particular context). The significance of this becomes apparent when one considers resource-based hypertexts, where a large number of unchanging nodes is provided as an encyclopædia or anthology. By describing these `generic links' for the resource one may add any new node to the hypertext (a teacher's overview of a topic or a student's essay) and it will be already tied into the corpus by the existing link framework.

In this respect Microcosm provides an excellent framework for developing resource-based learning materials, as demonstrated in the HiDES history courses [41], or the SToMP TLTP initiative [6]. As a particular example of the advantages of this authoring paradigm, consider setting up a multiple-choice test based on material in a standard course text. In a normal environment containing only specific links between nodes, for each possible wrong link (i.e. wrong answer) a separate correcting explanation must be written for the user, recalling the material in the original sources. Using Microcosm, the question, the text of each answer and any explanations written will all automatically be linked back to the concepts in the original sources by virtue of generic links from those sources which apply to the text used in the quiz.

The previous IBM PC version of Microcosm (version 2.2) provided a limited set of generalised link facilities (specific links from a point in a particular document to a point in another particular document, local links from a selection in a particular document to a point in another particular document and generic links from a selection in any document to a point in a particular document). The latest version of Microcosm (version 3.0) provides a hierarchy of logical types which can be used to classify each document in the hypertext . This new facility allows the `genericness' of links to be expanded: instead of just three levels of contextual constraint a link could also be constrained by the class or superclass of its source document. For example, in a Computer Science hypertext application, a link between the word `PASCAL' and a glossary entry giving a short description of the programming language may be valid only in documents of type `Introduction', rather than in technically more detailed documents of type `Language Syntax'.

The flexibilty of Microcosm link sources provides its `reversed' hypertext authoring paradigm (how may other nodes be linked to the current node?). It is possible to argue that generic links have similarities with StrathTutor's linking mechanism; in setting up a generic link between a source selection and a destination node, the author is classifying the destination node or labelling it with an attribute corresponding to the selection. Following a link is then a matter of selecting the text corresponding to an attribute. This is seen in practise, since an author frequently sets up a set of generic links on a resource document by selecting the phrases which are seen as describing the contents and setting up generic links from those points to themselves. Effectively the author, using the generic link mechanism, is labelling the document with textual key words or key phrases. Thus the authoring paradigm has become declarative in nature, describing the data rather than the processes involved in document links. Microcosm is similar in this use of generic links to systems like WAIS and StrathTutor. In all these cases there are no explicit connections between source and destination documents. Instead, the destination documents, either labelled or indexed, are a set of external information resources that bind, on the fly, to specific occurrences of source document selections. Microcosm is different from these systems because these bindings are the expression of authored links more usually seen in standard hypertext model systems like Intermedia or HyperCard.

Hypertext packages are frequently difficult to author in a scalable or generic fashion which allows for expansion or economic re-use for different purposes. The links, authored for a particular purpose, are fixed inside the document content and fixed to specific destinations. Expanding a Microcosm hypertext by adding new nodes involves one of two scenarios. If the nodes are new general resources (primary materials) then a group of new generic links must be added which will retrospectively apply to the existing hypertext components. If instead they are new secondary materials (students esays or teachers commentaries on the primary materials) then they will already be affected by the existing links. In this respect the Microcosm hypertext model is incrementally scalable.

Changing the purpose of the hypertext may involve keeping the collection of nodes substantially the same, but reworking links to provide different structures of access. In many hypertext packages changing the links means rewriting the texts because the links are embedded in the texts--in Microcosm it simply means applying a new set of linkbases to the same material, in a similar way to Intermedia's use of webs. The second advantage to Microcosm is that material which is added during the `repurposing' process will be automatically affected by any retained linkbases. Since many hypertext packages provide embedded point-to-point linking (i.e. from here you can go here) they fail to offer such expandability or maintainability.

Microcosm still retains many disadvantages when it comes to maintaining a hypertext. Since the current implementation has most links expressed in terms of fixed destinations (a generic link has a flexible source) then changes made to the destination may invalidate a link. The Document Manager provides a level of indirection between document identifier and physical file name, so renaming or re-organising a collection of hypermedia resources is not harmful, but removing a component file may leave a `dangling link'. Editing the contents of a destination file is likely to lose synchronisation between the offsets held in the link databases and the document contents and so `shift' the link anchor away from the correct link endpoint. This is not a problem for link sources, unless a specific link has been used.

This `editing problem' is not unique to Microcosm: see [50] for the way in which it is apparent in HyTime, and [48] for more details about how a hypermedia environment can be constructed to minimise the effects of this problem.

4.3.1 Microcosm Structure

In an environment in which the hypertext structure may be expressed by loose links, it is not obvious how hypertext authors will choose to compose their hypertext networks. Will there, for example, be specific links for static (process-based) navigation paths and generic links for content-based browsing? Here we examine three educational hypertexts to see the various techniques used.

4.3.1.1 Solar Cells

The first hypertext is on the subject of `Solar Cells', and authored by Dr Gerard Hutchings of the Department of Electronics and Computer Science at Southampton University using Micorocosm version 2.2 and based on material pre-authored for the UNESCO Energy Training Programme. It is aimed at postgraduate level students.

This hypertext has been translated from an existing book, and has an initial structure which is based on that book. The material is split into disjoint nodes which follow the book's division into sections and subsections, although the nodes only implicitly model the hierarchical structure of the original. The hierarchy is made explicit only on the `Contents' node which contains a table of contents with buttons linking the section and subsection titles to the appropriate nodes. Generic links have been created according to the index entries which the original author has indicated.

As well as the organisational structuring and the content-based linking, summary documents (containing links to the nodes they summarise) and overview documents (outlining the educational aims and objectives of the hypertext) have been provided. These have been added as pseudo document types (actually aliases for document type TEXT), allowing the user to distinguish them from normal contentful text documents in Microcosm's standard `Open Document' dialogue box. The pseudotypes `Photograph' and `Diagram' have also been added as aliases for documents of type `BITMAP'. We can see the various structures inherent in and imposed on the hypertext:

* an organisational hierarchy accessed via the specific links on the `Contents' node

* a cross-reference web of subject-based generic links, accessible from any use of a technical term in the material

* a catalogue of available nodes listed by document type

The catalogue of nodes is an inherent part of the Microcosm system (at least of this implementation). Every document that is known to the Document Manager is available from the `Open Document' dialogue box, listed according to its type and with its (user assigned) description. Since this dialogue box is one of the most obvious features of the Microcosm user interface, and coupled with the fact that a hypertext network has no explicit starting node, users often make quite heavy use of this method of navigation in opposition to hypertext links.

4.3.1.2 Cell Motility

The second hypertext is `Cell Motility', a biology application written by Dr Gerard Hutchings and used in the teaching of undergraduate biology at Southampton University. It is a conversion of an existing hypertext, authored in StackMaker, a HyperCard-based authoring system [72] by the same author. This hypertext exists in two versions: version 2.0 is a simple translation of the original HyperCard stack whereas version 2.2 takes advantage of additional Microcosm features.

The original stack consists of a number of cards accessed through an Introductory card which contained a textual overview of the topics available and a set of buttons which lead to these topics. Contents and Index cards are also available to aid navigation. Version 2.0 was a faithful copy of this stack, with one document per card, and with no generic links, only buttons (i.e. highlighted specific links) navigating across the hypertext. The buttons either navigate from one topic to another (i.e. organisational structure) or provide subject-based cross-references. The only additional feature is the Microcosm catalogue of nodes which simply mirrors the Contents node in this case.

Analysis of logs taken of the student's use of the Microcosm and HyperCard versions shows that 43% of the HyperCard users interactions with the system were link following actions, rather than browsing through the table of contents or alternative navigation mechanisms. Contrasted with that figure, only 26% of Microcosm users actions were link following [73]. One of the reasons given for this is that the Microcosm catalogue is always conveniently available, whereas the Contents node must be explicitly requested. It is interesting and salutory to note that, as far as this group of inexperienced hypertext users were concerned, the most natural way to use a Microcosm hypertext is not through the linking facilities.

The second Microcosm Cell Motility application (version 2.2) is much more interesting in terms of its use of Microcosm features. It is structured as follows:

* the original nodes are retained.

* the original buttons (specific links) are retained. These are now know as the Tutorial Links.

* a set of Reference Links have been added. These are generic links to 30 of the most significant original nodes from appropriate key phrases.

* a Biology Dictionary has been added. This is a set of documents containing definitions for 3,000 biological terms along with the generic links which link into those dictionary documents.

* a Quiz has been added. This is a single Toolbook document which implements a set of multiple choice questions. It consists of a number of questions each with a set of possible answers and with each possible answer an associated short (up to 256 character) explanation of why it is correct or not.

Each of the above components is completely modular and has its own separate linkbase (the same Biology Dictionary is used in a number of different applications). The quiz deserves further elaboration: it has no links of its own, but it is subject to the links from the other resouces. When the user chooses one of the multiple choice answers then a dialogue box appears with the associated explanation and a button labelled `Further Explanation' which causes Microcosm to look for all the links which apply to any parts of the text in the explanation.

4.3.1.3 Waves

The third hypertext is `Waves', a physics application collaboratively written by a consortium of physicists from the SToMP TLTP group [6], with particular input from Dr. Steve Rake at the Multimedia Lab at Southampton University.

The application is structured into a set of Reference Works (a physics textbook, a databook, a glossary, a set of biographies and a bibliography) and a set of Teaching Documents (Tutorial Documents, also known as Scripts, and Activity Documents, usually ToolBook simulations or question and answer sessions). The reference works are mainly linked to by generic links from appropriate text selections in the tutorial works, except for the bibliography which is linked to by specific links from the tutorials. The textbook also contains a large number of specific links to its sections from the various tutorials, but the databook is mainly intended for browsing.

The tutorial works are the main teaching mechanism and are closely bound to their related activities by specific links. Some generic links are used to lead users into particular teaching sessions on particular subjects, but the tutorial sessions have, as the project has developed, become largely sequential. Each teaching document is a long textual document, intended to be followed from beginning to end, and each document contains specific links to the next and previous teaching documents. (These links are over the words "Next" and "Previous" and so the the order of teaching delivery has not been tied to the document content and can be customised by applying a different linkbase.) A decision was made to use no specific links between teaching documents, to protect students from the navigational confusion of jumping from topic to topic.

Much use is made of the logical type facility of Microcosm 3.0 for providing a classification hierarchy as a navigation mechanism for the students. Each major teaching theme also has a graphical map which indicates the units which should be followed, the order in which they come and the way they are grouped into sub-themes.

The authors of the project material have found the structure of the hypertext has varied as the facilities of the viewing application have evolved. Originally each teaching document had many associated annotation documents, but after the viewer implemented popup windows for marginalia these documents were incorporated into the appropriate `master' document. This has pruned the classification hierarchy considerably, and substantially reduced the total number of documents.

This hypertext bears many similarities to the previous two examples: it contains a suite of reference information which is accessed by a network of generic links. It also is subject to a classification imposed by the document manager. Where is differs is in the information-rich tutorial documents. Although all the subject-based information can be found in the reference works, the tutorials are more than a simple set of directed walks through the reference resources. For this reason they have general links leading into them, but no elaborate cross-referencing between them which would hinder the process objectives of making students understand a particular topic.

4.3.1.4 Conclusions

One of the common features of these three hypertexts is the emphasis on a modular construction of linkbase and document sets, knitted together by Microcosm's generic links. This allows hypertexts to grow by adding new documents without reauthoring document contents or existing links. It also enables new facilities to be slotted into the hypertext without disruption.

It is apparent from these informal studies that the document manager plays a major role in providing static structure and dynamic navigation around a Microcosm hypertext, especially with the latest facilities for assigning logical types to documents. The document manager is providing a simple classification system somewhat like the use of generic links described above but separate from the hypertext link facilities and the document contents. This fixed classification is appealing because it gives the user the illusion of an absolute frame of reference with which to orient themselves. However, the real power of Microcosm lies in its generic links which in turn supports the concept of generic and reusable authoring.

4.3.2 Microcosm Model

As discussed above, the current Microcosm implementations have focussed on a restricted subset of the model's potential for hypertext linking. Currently only textual comparisons are used to define generic links, and the flexibility of the links (or genericness) is available only on three levels. In this section we try to describe the Microcosm model instead of a specific Microcosm implementation, and use that model to demonstrate new kinds of generic link which improve the robustness of the system in the face of a non-static set of hypertext resources.

The Dexter Hypertext Reference Model [62] is a general model for hypertext, and often used to compare specific hypertext systems against. The Dexter model divides a hypertext system into three layers: the storage layer which is used to hold the node, link and anchor components of the network, the runtime layer which deals with user interaction mechanisms and the within components layer which addresses the content and internal structures of individual nodes. A link is a combination of specifiers; each specifier consists of a component specification (which resolves to a single node in the hypertext) and an anchor composed of an identifier (unique within each node) and an (undefined) mechanism for pointing into the node's (opaque) contents.

In some measure Microcosm is isomorphic to the Dexter model: the storage of nodes and links is separated and held by the host file system. The presentation of nodes and links and the user's interaction is controlled by the viewing applications and can be changed according to the user's requirements. The within components layer is equally vague in Dexter and Microcosm, where the interpretation of the content of nodes is left to the viewing applications. Both Microcosm and Dexter require an opaque handle into the node contents in order to specify the position of a link anchor; the handle may be a numerical offset from the start of the file, a hierarchical tree position, a two-dimensional co-ordinate in an image, or any required measurement. Where Dexter and Microcosm part company is in the relationship between links and nodes: each end of a Dexter link resolves (perhaps by a rule) to a single node, but a Microcosm link may resolve to arbitrarily many nodes. Because of this disparity it is difficult to descibe Microcosm in terms of the Dexter model.

We have seen that Microcosm provides a declarative authoring model but that link following is usually explained procedurally, in terms of the flow of messages and the actions of each individual Microcosm process. Especially because of the dynamic nature of the Microcosm environment, where viewers and filters are added or removed at will, it has been difficult to describe `link following' formally. This section shows that both link creation and link following can be described declaratively, and gives a simple declarative model (expressed in Prolog) which demonstrates all the features of Microcosm.

The declarative model has repercussions for the end-users of the system--by encouraging authors to think according to a declarative paradigm makes their task easier and allows greater extensibilty. A similar comparison can be made between two commercial font rendering schemes: Adobe's Type 1 and Microsoft's TrueType. The former consists of a character outline and `hints' about the important or potentially problematic regions of each character shape. The latter consists of a program with instructions to render the shape of each character. Each TrueType character has to contain explicit instructions to draw itself at every device resolution, and the expertise to maintain a good-looking character shape on low-resolution devices such as computer displays. The advantage of Type 1 is that it is easier to code individual character shapes: all the intelligence is in the (external) font rendering engine, and as font rendering technology improves the same set of outlines and hints can be drawn with better quality. The secret of Type 1 success is the ability to come up with a general set of hints that apply to all kinds of character shapes under all kinds of conditions. By analogy, a declarative approach to hypertext links that provides a general set of mechanisms for expressing relationships between nodes, may allow an improved link engine to provide a `better' set of destinations sites from a given source.

4.3.2.1 Link Representation, Creation and Following

Microcosm links are stored in a link database (linkbase), where each link documents a relationship between two `anchors' (i.e. two points in a "document space"), and can be expressed in Prolog as follows:

relates([documentA, offset1, "selection"], [documentB, offsetB, "other"]).
relates([documentA, offset2, "foo"], [documentC, offsetZ, "bar"]).

Creating a link requires the user to make selections "Point A" and "Point B". The system then stores the attributes of these selections in a linkbase. The following Prolog fragment assumes for simplicity that there is only one linkbase, and that it is stored in memory, not written to a file in permanent storage.

createlink([SrcDoc,SrcOffset,SrcSelection],[DestDoc,DestOffset,DestSelection):-
assert(relates([SrcDoc,SrcOffset,SrcSelection],
[DestDoc,DestOffset,DestSelection])).

Following a specific link from "Point A" is equivalent to gathering all the relevant attributes of "Point A" (i.e. [documentA, offsetA, selectionA]) and evaluating the following Prolog fragment to find point B:

relates([documentA, offsetA, selectionA], [WhichDoc, WhichOffset, WhichSel]).

Following a local link from "Point A" is similar to the above, except that the following Prolog fragment is elaborated instead:

relates([documentA, _, selectionA], [WhichDoc, WhichOffset, WhichSel]).

Following a generic link from "Point A" is also similar to the above, except that the following Prolog fragment is elaborated instead:

relates([_, _, selectionA], [WhichDoc, WhichOffset, WhichSel]).

(In this case the three different kinds of link are obtained by the application of three different rules to the linkbase. This is also seen in the current Microcosm implementations: document, offset and selection data are stored for all three kinds of links, but an extra field is used to distinguish between the different link types. We shall pick up on this distinguishing information at a later stage.)

Having located the destination of the link, this must now be displayed to the user by the document dispatcher. Most of the facilities that the dispatcher makes use of are beyond the scope of this model (running a program, opening a document and making a selection), but are shown in the following Prolog fragment.

dispatchlink([DestDoc,DestOffset,DestSel]):-
typeofdocument(DestDoc,Type), applicationof(Type,App), run(App,DestDoc),
makeselection(App,DestDoc,DestOffset,DestSel).

Assembling all these fragments yields the following simple schema for link following.

followlink(Source,Destination):-
findlink(Source,Destination), dispatchlink(Destination).
findlink(Source,Destination):-
specific(Source,Destination);
local(Source,Destination);
generic(Source,Destination).
dispatchlink([DestDoc,DestOff,DestSel]):-
typeofdocument(DestDoc,Type), applicationof(Type,App),
run(App,DestDoc), makeselection(App,DestDoc,DestOffset,DestSel).
specific([SrcDoc,SrcOff,SrcSel],[DestDoc,DestOff,DestSel]):-
relates([SrcDoc,SrcOff,SrcSel],[DestDoc,DestOff,DestSel]). local([SrcDoc,SrcOff,SrcSel],[DestDoc,DestOff,DestSel]):-
relates([SrcDoc,_,SrcSel],[DestDoc,DestOff,DestSel]).
generic([SrcDoc,SrcOff,SrcSel],[DestDoc,DestOff,DestSel]):-
relates([_,_,SrcSel],[DestDoc,DestOff,DestSel]).

Purely declarative frameworks have no side effects, but Microcosm makes judicious use of side effects to keep track of a user's session history. Adding the following definition (a pseudo-dispatcher which just saves its arguments to a file) to the above schema allows us to keep such a history.

dispatchlink(Destination):- tell(`History'), print(Destination), nl, told, fail.

The three link following actions shown above are really just `plug and play' semantics--they are the most widely used semantics for link following, but are by no means the only ones. Another standard Microcosm link facility is the so-called "Computed Link" which does text-retrieval operations based upon the selected text. Its definition would be similar in form to that of the generic link, i.e. ignoring document and offset information. (A more sophisticated version of this facility may make use of the selection's semantic context and so take into account all these details.) Any mix of these (and other) features may be chosen at the start of the session, so that it is impossible to tell in advance how link following will be accomplished. In fact any feature may be added or removed at any time during the session, so it is impossible to guarantee that a particular link (relationship between two document points) will be available under all circumstances. In short, Microcosm semantics are dynamically added to the system, rather than being a static feature of it. To accomodate this flexibility within the model, let us allow the link following predicates (or link resolvers) to be defined dynamically, for example, reading their names from an initialisation file, and asserting this list of names as a fact in the Prolog knowledge base.

startup:-see(`Resolvers'), read(Resolvers), seen, assert(resolvers(Resolvers)).

Now let us rewrite the findlink predicate to take a particular resolution function name as a parameter, rather than explicitly including each function as an alternative.

findlink(Resolve, Source, Dest):- Resolve(Source, Dest).

And consequently, let us make followlink use all the resolvers in turn, storing all the link destinations in a list. A variant of dispatchlink must now be used to handle multiple link destinations.

followlink(Source, Destination):- resolvers(ResList),
findall(Dest, (on(Res, ResList), findlink(Res, Source, Dest)),
DestList),
dispatchlink(DestList).
dispatchlink(ListofDests):-
ListofDests=[[_,_,_] | _], % check we have a list of dests here...
askuser(`Which links?', ListofDests, ActualDests),
map(dispatchlink,ActualDests).

Now we have given three alternative link dispatchers: one which acts on a single destination, one which acts on multiple destinations, and the history pseudo-dispatcher. We could make the definition of followlink more symmetrical by allowing a dynamic list of dispatchers, mirroring the definitions for the list of resolvers.

If we wish to increase the range of available link types, how should extra information be held in the link base, and how should a new resolver be coded? For example, if a new kind of link were implemented which only applied to source documents of a given type, then either all the work could be accomplished in the resolver

srctypelink([SrcDoc,SrcOff,SrcSel],[DestDoc,DestOff,DestSel]):-
relates([SrcDoc,SrcOff,SrcSel], [DestDoc,DestOff,DestSel]),
typeof(SrcDoc, 'Introduction').

or all the necessary information could be coded in the link itself

relates([Doc, _, 'PASCAL'], [gloss, 1000, 'PASCAL']):-
typeOf(Doc, 'Introduction').

Examining the user interface for link creation in the PC Microcosm implementations shows that in creating a generic or local link, the user actually defines a specific link along with a rule for generalising it. An alternative view is that the user gives an example of the generic link, from which the other links can be deduced. Either way, the user expresses a specific relationship which is an instantiation of a general rule , and this reflected in the entry in the linkbase which codes all of the information about the link (even the source offset and source nodes which are redundant for a generic link) along with an indication of the link type. Hence we could expand the linkbase entries as follows:

relates(generic, [doc21, 128, 'Internet'], [glossdoc, 1500, 'Networks']).
relates(specific, [doc34, 100, 'TeX'], [glossdoc, 2000, 'Typesetting']).
relates(srctype, [overviewdoc, 345, 'PASCAL'], [glossdoc, 90, 'PASCAL']).

The last of these examples may be a way to code the `generic link constrained by source document type' proposed earlier. The addition of a link type to a link is analogous to the addition of a hint to a Type 1 font: a link can always be deduced from the data in the linkbase entry (i.e. the specific example provided by the user), but if a suitably intelligent resolver is present it can use the link type to generalise a new set of links from this example. Any set of new link types will probably need to access more information about the link context than simply the document id, the selection and its offset within the document. Other information, such as the document type, or the document's description, keywords or even contents can be obtained from the document manager by the link resolver, but here we choose to explicitly store the information in the link base in order to make the declarative model more transparent.

relates(srctype, [overviewdoc, 345, 'PASCAL', introtype, "Languages Overview "],
[glossdoc, 1000, 'PASCAL', referencetype, "Languages Glossary"]).

The accompanying resolver for srctype links could be defined as follows:

srctypelink([SrcDoc, SrcOffset, SrcSel, SrcType, SrcDesc], Dest):-
relates(srctype, [_, _, SrcSel, SrcType, _ ], Dest).

The model now gives a mechanism for describing flexible generalisations on link sources, but can it also provide generalisations on link destinations (such as the `Computed Links' text retrieval mechanism in PC Microcosm) since resolvers only produce one destination from a link? In fact this is taken care of by followlink's use of findall which not only tries each resolver in turn, but also retries each resolver until it produces no more destinations. Hence computed links may be expressed as follows in the linkbase:

relates(computed, [overviewdoc, 345, 'PASCAL', introtype, "Languages Overview "],
[glossdoc, 1000, 'PASCAL', referencetype, "Languages Glossary"]).

and could be implemented by the following resolver (where the definition of the hypothetical grep is not included here):

computedlink([SrcDoc, SrcOffset, SrcSel, SrcType, SrcDesc],
[DestDoc, DestOffset, DestSel, DestType, DestDesc]):-
relates(computed, [_, _, SrcSel, _, _ ], [_, _, _, _, _]),
grep(SrcSel, globalindex, [DestDoc, DestOffset, DestSel, DestType, DestDesc]).

In this model every link must be labelled with a type name which can be recognised by a resolver. In the same way that an improved font rendering algorithm may produce better character shapes from the same Type 1 description, improvements to the resolvers may produce more, or more relevant, sets of destinations from the same links. For example, the generic links resolver could be enhanced to take into account variations in spelling or the use of homonyms in the selected text. Similarly, the above srctypelink resolver may select documents not only with the given type, but also whose description or keyword attributes include the type name.

By working in a declarative environment, it is possible to expand the link type in the linkbase entries from a pure label into an expression which is itself a representation of the relationship. For example, the following linkbase entry represents the (rather pointless) link between any word beginning with the letter `a' and a particular destination:

relates( string2list(SrcSel,[a|_]),
[mysrcdoc, 1234, "apple", 'text', "pointless"],
[dictdoc, 514, "Alpha", 'text', "dictionary of letters"]).

where SrcDoc, SrcOffset, SrcSel, SrcType and SrcDesc along with the corresponding Desc- forms are Prolog variables which will be instantiated before the `type' is evaluated.

4.3.2.2 Conclusion

We have shown that Microcosm can be described in a declarative style in which link resolution is parameterised not only by the link data but by the resolution semantics. We have also shown how encouraging a declarative attitude to link authoring is helpful to the authors and maintainers of a hypertext.

4.3.3 Microcosm Meets Structured Documentation

Microcosm has typically been used with non-structured textual data: each of the lexias is a simple text document that does not consist of an (explicit) internal network of subcomponents. This section describes the author's attempt to use Microcosm as a hypertext environment for displaying structured documents expressed in SGML . By using an SGML parser to convert structured documents into a form which can be interpreteted by a Microcosm-aware application it is possible to provide hypertext navigation facilities for SGML documents.

This experiment was undertaken, by the author, as a demonstration in conjunction with Oxford University's Elektra project: (a study of 17th and 18th century women's literature). The texts were keyed in using a specific SGML document type based on the Text Encoding Initiative DTD [30], which was subsequently parsed by sgmls, an SGML parser. The intermediate output from the parser was processed by a UNIX awk script into an RTF file (for display by Word for Windows) and a linkbase (for use by Microcosm). The structure of each of the texts was similar: a title page, a table of contents and a sequence of chapters. Each chapter was divided into pages, paragraphs and lines (all strictly recorded in the markup so as to give a visual authenticity to the electronic reproduction). The translation process added whatever hypertext navigation and automatic link creation was possible, however since the explicit structure of each text was simpler than the technical reports that were used in LACE (the chapters are not subdivided into sections, and there are no explicit cross references or citations) this was mainly limited to linking each element of the table of contents to the matching chapter. The translation process also added links from the title page, table of contents and first page of text to grahics files which contain images of the corresponding page in the original printed document. A link was also made from any page in the document to biographical information about the author. These links were specified not by the document's structure, but agreed as standards for the project.

Click here for Picture

Figure 4.5a: An Elektra document under Microcosm

This translation process was specific not only to the DTD but also to this particular application. The TEI DTD is very general, and so formatting and linking decision were arrived at according to yield a specific style which was close to one particular original document. Some of the document markup allowed specifications of physical representation (e.g. the emphasis element had a representation attribute which could be used to code a specific font or style name to use to provide the emphasis) but the majority was chosen in the translation. Alternative measures such as the use of SGML's LINK specification to refer to style sheets may have been more appropriate for a more thorough experiment.

<!doctype ota>

<ota>

<text>

...

<pb n="19">but he taught them to be cruel while

<lb>he tormented them: the consequence

<lb>was, that they neglected him when he

<lb>was old and feeble; and he died in a

<lb>ditch.

<p>You may now go and feed your

<lb>birds, and tie some of the straggling

<lb>flowers round the garden sticks. After

<lb>dinner, if the weather continues fine,

<lb>we will walk to the wood, and I will

<lb>shew you the hole in the lime-stone

<lb>mountain (a mountain whose bowels,

<lb>as we call them, are lime-stones) in

<lb>which poor crazy Robin and his dog

<lb>lived.

</div>

<div type='chapter' n ='3' id=CH3>

<head>CHAP. III.

<lb><hi rend ='small italic'>The treatment of

animals&mdash;The story of

<lb>crazy Robin&mdash;The man confined in

<lb>the Bastille.</hi>

</head>

<p>In the afternoon the children bounded

<lb>over the short grass of the common,

</div>

</body>

</text>

</ota>

Figure 4.5b: SGML markup for Elektra document

Although this translation successfully allows Microcosm access to SGML documents, it is less sophisticated than the similar LACE process. There is no resultant collection of individual subnodes, nor can any part of the document be referenced externally. Instead, source and destination anchors are created inside the (linear) text as gotobuttons and bookmarks as described in section A1.11. Links that occur as a result of the translation of the document's internal structure are therefore not even handled by the Micocosm link engine, but by the word-processor itself. The user is free to make further links which are processed by Microcosm. See figures 4.5a and 4.5b for the markup of an Elektra document together with its display under Microcosm.

4.3.4 Microcosm Meets WWW

The World-Wide Web is characterised by three components: (i) a single, well-defined native data format used with the Web document viewer, (ii) a universal addressing scheme with associated transfer protocol and (iii) a hypertext authoring scheme in which precise destination addresses of links are specified as part of the source documents. The Web's standard data format is based on SGML, and so is amenable to the same treatment as the SGML data described in the previous section. In comparison with WWW, Microcosm is characterised by (i) a co-operative framework for diverse document viewers and (ii) a hypertext authoring strategy which is based on generic relationships between source and destination documents.

Link fossilisation is a significant disadvantage of WWW and occurs because link specifications have to be published as part of the document and cannot be changed without revising the document. Since links refer to their destination anchors via a specific machine name and path name then any change to the position of the destination requires every source document which refers to it to be changed--once published a document can never be moved or deleted. Although this is not an insurmountable problem in a locally controlled context, WWW used as a world-wide publishing mechanism assumes that every document is forever associated with its published address.

Dead ends frequently occur in WWW because only native WWW documents can have embedded links. If traversing a link leads to a foreign document being displayed by a foreign application (an RTF file displayed by Word) then no WWW links may be followed from it.

Microcosm does not suffer from these problems. Dead ends do not occur because almost any program can be used as a Microcosm viewer for many different kinds of data: links can be followed not only between text and graphic files, but between wordprocessed documents (Microsoft Word), design documents (AutoCAD), spreadsheets (Excel), databases (SuperBase), video documents (AVI) and simulations (SuperCard). Links do not get fossilised because they are not embedded in the documents to which they refer and because they represent rules for linking sets of documents together, rather than specific hardwired document references. What Microcosm does lack is the ability to access documents distributed across machines, but that facility of the WWW can easily be `plugged into' Microcosm by allowing a URL to be used in place of a Microcosm document id, as outlined below.

* Allow WWW files to be accessed by constructing a Microcosm filter which intercepts "Dispatch Link" messages from the Link Dispatch dialogue box. If the document is local to the machine, send the message on unchanged. If, on the other hand, the document to be opened is specified by a URL, check to see if it has already been downloaded into a local cache directory. If not, request the document from the appropriate network server and write it to a local file. If appropriate use an SGML parser to translate the HTML into RTF for viewing in Word for Windows. Once the (possibly translated) file exists on the local disk, send a new Microcosm message asking for it to be dispatched instead of the remote URL.

* Build a filter that allows links to be made to or from a WWW document. This filter should come in front of the MakeLink filter, intercepting any messages which indicate links to be made to files in the cache directory. It should translate the local file name back into the original URL and emit a corrected Make.Link message for the real link maker to pick up and add to the appropriate linkbase.

* Build a filter that allows links to be followed from a WWW document. This is a simple filter that reacts to Follow.Link messages with a URL as the SourceSelection by outputting a Dispatch.Link message with the DestDocument set to the same URL.

These modifications allow Microcosm to display HTML and other WWW files, follow WWW-type links in WWW files, create and follow Microcosm links to and from WWW files and create and follow WWW links in Microcosm files. These modifications are not an extension to or deviation from the standard version of Microcosm, rather they constitute a different configuration of the standard Microcosm framework to deal with a new information service. (The reason behind using Word as a Web viewer, in a similar fashion to the TEI viewer of the Elektra project descibed above, is because the standard Web viewers do not allow selections to be made, so making them difficult to use with Microcosm.) This mixture of facilities has the following effects on the users of both systems:

Microcosm readers normally have access to two kinds of material: task-neutral resources (such as dictionaries or literary anthologies) and task-specific resources (comments, essays, questions). Now navigation is not limited to the local environment, but extends to external non-task-specific resources.

Microcosm authors have the same improvements of navigation as for readers. However, this places a heavier burden on the author who is acting as a teacher or trainer, since it is his or her responsibility to be acquainted with the (constantly growing) set of resources which can be made available to the readers. Microcosm documents (and whole document collections) can now be made globally available, as can the Microcosm linkbases.

WWW readers are freed from the tyranny of the button: in order to access a piece of information on the Web it is necessary either to know its address or to be able to find a document that contains a link which references it. In an environment which has no alternative methods of navigation (e.g. a hierarchical structure) this can cause considerable problems, especially if documents are revised [65]. Although a problem in a localised hypertext environment, this is especially significant in a global, unco-ordinated information system. Using Microcosm's `generic links' the reader should be able to select any relevant text as a link to the required information.

WWW authors have greater freedom in the authoring process: instead of providing explicit buttons for navigation to every relevant piece of material, generic links can be used to provide standard services across a whole domain of information.

The above modifications are currently being made available for Microcosm. The author has currently produced the software to retrieve a WWW document and translate it for viewing under Word for Windows.

A practical problem with this approach has been the conversion between HTML and RTF. Although HTML has been cast in terms of an SGML document type definition, HTML documents are seldom used with a full SGML parser to verify them. Common usage frequently breaks the strict definition of the DTD, and so means that most user-authored HTML documents cannot be correctly parsed by an SGML-based process. For this reason the actual DTD used here is more relaxed than the DTD distributed by the WWW development team.

The advantage that WWW brings to Microcosm is access to a global hyperbase, but the advantages that Microcosm bring to WWW are flexible authoring of links and the application of links to an increased range of information media.

4.3.5 WWW Meets Microcosm

The previous section defined how to make Microcosm into a local viewing environment for the World-Wide Web (a true hypermedia replacement for the standard Mosaic viewer). It is possible to provide the advantage of flexible link authoring and following to WWW users who do not have a local Microcosm environment by implementing some of the Microcosm features inside a standard WWW environment. This section describes the author's effort to translate some of the major components of the Microcosm model (elaborated earlier on in this chapter) into the WWW environment and so to provide Microcosm services to all users of the Web (also explained in [66]).

The major features of Microcosm are the selection and action link-following paradigm, external linkbases and the message passing framework. WWW provides both a message-passing framework: messages (in the form of URLs) are sent by a client viewing application via HTTP to a WWW server and a document (in HTML format) is received back. A client which wants to obtain Microcosm link services can then express its link request message in URL format and send it via HTTP to a Web server. The server can invoke a process which mimics the action of the Microcosm linkbase filters, and which sends back a (in HTML format) list of destination documents that were matched by the linkbase. That document would be displayed to the user as an equivalent of Microcosm's Link Dispatch dialogue box, allowing the user to choose from among the available destinations by clicking on the HTML buttons which describe them.

An essential pre-requisite for Microcosm generic links is the ability to make arbitrary selections within documents, not just to click on predefined buttons. This is accomplished in the PC environment by turning the wordprocessor Word for Windows into a WWW browser (as explained in the previous section) since the current version of the PC Mosaic viewer does not allow the user to make selections. Mosaic under the UNIX X Window environment does allow the user to make selections, and so a simple application called `Microcosm Lite' has been written which presents the user with a single button labelled `Follow Link'. When this button is pressed the application grabs the current selection (whether from the Mosaic viewer, or any arbitrary window) and turns it into a URL which Mosaic then sends to the linkbase server.

Click here for Picture

Figure 4.6: Microcosm Lite in use

In this way, whichever application the selection was made in, the link request and destination display is made by Mosaic (figure 4.6 shows a selection and the resultant set of links returned from the linkbase). Because of the current limitations of Microcosm Lite, only the selection and the type of the source doument's application is sent to the linkbase, unless the selection came from Mosaic in which case the source document name (or URL) and source document description (or title) are also available.

The link server and `Follow Link' message are specified by a URL as follows:

http://host.site/htbin/linkbase?userID+srcSel+srcOffset+srcDoc+srcType+srcDec

and is received by the linkbase program htbin/linkbase on the machine host.site. The program matches the selection against the links it maintains, and responds with an HTML document containing a list of the destination documents. The link server can also accept an extended message format to create a new link in the link database (link type and destination data are added).

The link server should encapsulate as much of the Microcosm behaviour as possible in terms of configurable links databases. Since individual user sessions are not maintained, the link server in fact provides a static four-tier hierarchy of link bases: the first associated with the specific source file in which the selection was made, the second associated with the collection of resources to which that file belongs (i.e. a linkbase for all the documents in the current directory), the third associated with the site (i.e. a per linkserver linkbase) and lastly a private linkbase associated with the user who sent the request. The link server (a simple UNIX shell script in the current implementation) searches each of the linkbases for relevant links in turn. In order to provide these Microcosm link services, a Web server must have the linkbase shell script available. In order for a user to make use of the Microcosm link services he or she must install the simple `Microcosm Lite' application.

This chapter has analysed a number of approaches to Wide Area information access. WAIS provides a text-retrieval method of navigation to large, remote information resources. WWW provides a simple hypertext method of navigation around a distributed set of document clusters. Microcosm provides a local method of information access which is easy to scale to larger, distributed environments and which provides a degree of robustness in a dynamic corpus of documents, while retaining the advantages of authored links over purely statistical measures of document similarity. By applying Microcosm methods and authoring practises to the WWW environment it is to be hoped the Web will become more robust and easier to author generally useful resources for.

5. A NEW DOCUMENT ARCHITECTURE

In previous chapters we have demonstrated the structured nature of text, and examined the impact and use of structure within various hypertext environments. Moulthrop's description of hypertext being made up of grains of local coherence [104] and Stott & Furuta's classification of hypertexts into hyperbases and hyperdocuments [135] according to the degree of authored intent inherent in the network poses an interesting challenge in a semi-co-operative global environment such as the Internet: is it possible to expand the coherence found in single document (lexia) or in a supervised hypertext (created by an individual or planned collaborative effort) into a some form of global coherence? Or must a hypertext necessarily fragment into a hyperbase beyond a certain scale?

LACE used a structured document architecture as a mechanism for expressing local coherence and a means of representing complex lexias which, although individually hyperdocuments, were collectively stored as a hyperbase. Lace-92 extended Lace by providing a mechanism for creating these locally coherent, structured lexias from information gleaned from a global hyperbase. In this chapter we look at Lace '93, a method of expanding coherence beyond the confines of a single local document, allowing authored intent to be applied with a global scope and so provide hyperdocument facilities.

5.1 Global Hypertext and the Need for Coherence

In section 4.2 we introduced the World-Wide Web as a global hypertext, and demonstrated its division into content-bearing documents (typically not making much use of links) and catalogue documents, not containing much content, but containing a large number of links. In effect, the coherence exhibited in the global hypertext is either at a very local level (within individual documents or in closely clustered groups of documents) or artificially superimposed as an organisational convenience (navigational shortcuts from one site to another).

One feature of the Web, or presumably of any global hypertext, is the mixture of co-operative and autonomous components of its use. The organisation of each Web site is independent of any of the others, but the details of this organisation, and summaries of the available data are shared co-operatively with other organisations, and frequently published by `key' sites to the benefit of everyone. In contrast, the authoring of the documents at each site is typically performed in isolation, and without reference to the documents available elsewhere on the network. A co-operative effort may be involved within a site to make its collection of documents coherent, but this breaks down at the larger scale and is not exhibited between sites. In other words, according to Furuta and Stott, at a certain scale the Web ceases to be a hyperdocument (no authored intent and no coherence) and becomes a hyperbase.

The point at which this transformation occurs is the point at which the Web becomes difficult for a reader to use. According to [135], a key feature of a hyperbase is the need to supplement link following with data querying as an information discovery strategy. However data query implies the ability to rapidly enumerate the nodes of the hypertext, a facility not present in the current implementation of the Web. Since information is highly distributed throughout the many sites that compose the Web, any topic-based task becomes all but impossible if the reader is not acquainted with a set of likely sites to start investigating. (The synthesis of Microcosm link facilities into the Web, explained in section 4.3.5, is a useful bridge between the hyperdocument and hyperbase states in that it provides authored links that act like content keyword queries.)

The exact scale at which the transformation between hyperdocument and hyperbase occurs is not fixed: it is certainly possible to store a collection of mainly unrelated articles as a single resource (or even a single document): this is a hyperbase at a very localised scale. Conversely, it should be possible to author a document which draws together information in resources across the global network, from widely diverse sites: this is a hyperdocument at the global scale. Let us refer to the latter as a coherence document, since it is a document which expresses an authored coherence between the contents of many separate resources, and it adds coherence to the hypertext network in which it features.

In the figure 5.1, the first network is a classic `well-connected' hypertext, with each node `near to' any other node. This kind of network is frequently seen as the result of a planned authorship activity. A `coherence document' simply provides an alternative viewpoint of the network information to the view seen from any other node. The second network is partioned into disjoint subnets, and the coherence document provides an original and genuinely summative and cohesive view of the network contents. So a coherence document can provide a useful function in a network which does not already exhibit a high degree of coherence.

                                           
The coherence document provides yet another private coherent viewpoint on an already well-connected network. The coherence document provides a unique private coherent viewpoint on a partitioned network. Since the underlying network is not already well-connected, it acts to `glue' together the information there, and hence as a form of global coherence.
Figure 5.1: Coherence Documents

Catalogue nodes already provide simple lists of other places to go for organisational navigation; the coherence that the coherence document supplies should be in the subject domain: collating, comparing and contrasting the contents of other documents. It is post-hoc, added to the network as an afterthought (literally) and it must increase the structure inherent in the network without constraining the network.

In the above diagrams the coherence document is shown as being `different from' the network and without any links going to it, but the document must be placed somewhere within the document network and assigned a URL otherwise it could never be accessed. In this sense it does not providing an organising view on the network, imposed on the network from above; rather it is a participant in the network and subject to the rules of the network. As a consequence, the coherence document is just `another document', but expressed as the result of a particular authoring strategy. To make this document accessible to all interested parties it would be necessary to publicise its existence; since it is concerned with a single subject domain it may be necessary to provide a single, well-known catalogue URL which links to all the coherence documents.

Compare this with the authoring strategy of Theseus, in which the elements of the hyperbase are strictly isolated (no inter-component links) and the subject documents (coherence documents) are imposed onto and separate from the hyperbase.

5.2 Authoring Requirements for Coherent HyperDocuments

Given that coherence is the feature that distinguishes a hyperdocument from a hyperbase, how can coherence be best expressed on a global scale? There are two extreme document types that are often seen in practise: one is a simple catalogue of links in which any text is incidental, the other is a standalone document in which the links are incidental (perhaps implementing citations). The former type, frequently seen as a navigational aid on the Web, provides a view on the network without explicit authorial coherence, whereas the latter, similar in type to prevalent technical documentation, has a local coherence but lacks an explicit view on the network.

A different type of document needs to be adopted to express global coherence: one which provides both a view on the network by making promiscuous reference to other material, but also `grounds' these references within a coherent framework. This requires more than making passing acknowledgement to related texts, but expounding the reference and explaining its context. This kind of model can been seen in older forms of technical literature as documented in [110]. Here a wider variety of intertextual mechanisms are used than in current technical literature--as well as citations, use of titles, letters, personal and historical narratives is seen to signal intertextual content. The purpose of these enhanced referential features is to indicate the relevance and importance of the new work in an environmrnt where a scientific text is seen as intelligible only in the context of an existing body of texts.

So a coherence document may be seen as a weaving of internal and external ideas, local and transcluded paragraphs, according to a particular rhetorical form. Moulthrop comments "discourse on hypertext could be conceived not as a series of discrete presentations but as contributions to an ongoing conversation" ([104]), so instead of a rhetoric of technical or scientific writing which compels authors to express their own thoughts, ideas and conclusions in isolation with brief reference to the work of others, we can postlate a rhetoric which encourages the inclusion of other writings with copious comment and annotation. Such inclusion leads the reader back to the orginal work, allowing them to see the same information in a new context. Here we also see at work Landow's claim that in an electronic book "the boundaries of the text become permeable" ([87]).

The WWW project is the only example of a large-scale co-operative hypertext environment that currently exists and we have seen that common use of the Web makes it incoherent and highly partitioned. Yet to some extent this is not due to the fundamental features of the Web architecture: its documents can reference arbitrary resources, and the author's document model allows structured arguments to be expressed. However, the author's rhetorical model, which encourages brief references to external works is reinforced by the Web's native document structure, which provides a link anchor to act as a button for the reader to activate a new document. The anchor is typically intended to be only a few words long since it is highlighted to stand out from the main text. Anchors may be annotated to record the relationship between the link's source and the destination, but this feature is not used in practise. Even transclusions of external material are provided, but then only as a mechanism for embedding graphical material for viewing.

5.3 New Models for New Documents

If extending coherence beyond a local scale requires a particular authoring strategy (or rhetoric), that strategy requires a particular document architecture to represent it. For all its multimedia pretensions, the HTML document architecture is very simple, and conforms to the standard model of a document as a file. This is an inheritance from the early days of computing: a document was an ordered set of punched cards, then a set of 80 column lines in a file on magnetic disk, then a stream of text delimited as lines or paragraphs. These simple .txt files are probably still the most widely used document (non-) format to date, but the file model by its very nature partitions information into here and not-here units (included and excluded information), complicating any sharing of information, and turning the concept of shared access to relevent information (hypertext) into something exotic.

By contrast, we have seen in section 2.2.3 that documents can be composed of objects, and that each object may have various attributes to defines its intended use. This leads to a new document model: that of a document as a view on a collection of objects. The objects may be contained in a single file, spread across several files, or even shared between several host computers. The view defines how the document is treated--how it is to be collated and composed from its set of component objects.

At first sight this new model seems overly complicated, but consider as an example a business report marked up in SGML: it may consist of list-of-contents elements, index elements, glossary elements, revision history elements, security elements, chapter elements, heading elements, paragraph elements and text data. An SGML DTD may define the parse structure of this file, but what does the document actually consist of? What is the current information contained in it? What information is viewable given the security clearance of the current session? In short, faced with a collection of objects with a complex set of relationships between them, where does one start in order to simply elaborate the meaning of the document? The answer to this is that the application which processes the document is responsible for untangling the network of objects and recognising their interdependent semantics. In the SGML world objects have attributes which identify them and the use to which they are put. An element's tag name can be seen as a particular case of an attribute which is also used in parsing the document according to an external grammar. This grammar may not reflect the meaning of the document in any normal sense--it may only indicate the way that a document can be expressed as a linear stream of text. The meaning of the document is derived by an application that understands the document, and may involve extracting particular objects from the document based on their attributes. Making sense of a typical document may involve the following two simple steps:

a Find the highest priority content object (e.g. out of the possible sections, subsections, chapters, parts, books, or volume structures).

b Elaborate the inline text content of this object and recursively any sub-objects that it contains. Also interpret the relationships between this object and any other objects (footnotes, glossaries, index entries, tables of contents, cross-references, marginalia) and display the related objects or an indication of the relationship if deemed necessary.

This process may become more elaborate according to the meta-information that is stored with the document. For example, revision histories and security information could add extra levels of checking for inclusion or exclusion of particular objects. Increasingly, the lexical structure of a file or entity is not clearly mapped onto the content structure of a document. (This tendency has been exacerbated by HyTime's hyperlinks as we shall see: footnotes, index terms, annotations and even whole sections and chapters may exist as independent entities, joined only by a ilink element in an independent part of the document.) In short, the SGML markup presents a container architecture for the document. In such a sitiuation it is the responsibility of the controlling application to understand and make use of the contained contents by means of the relationships between the linked items.

A number of standards are emerging from industry and internation bodies which are of particular relevance to document architectures. Many of them also move away from the simple assumption that document [[equivalence]] file by defining an object-based architecure which may be implemented as part of a file's contents. Although de jure standards are not usually welcomed as a popular computing activity, they are of particular importance when dealing with the production of a global information resource. Such a goal (which was one of the original aims of hypertext [108]) requires a collosal amount of investment from the producers of this information, and longevity of the product will be one of the main requirements to protect this investment. Crane [43], arguing that our contributions to this global information resource should remain part of the public record for an indefinite period, maintains that we should immediately move to a common interchange standard to be able to fully share information and functionality, and then allow this standard to evolve as the problems of large-scale hypermedia become better understood. Brown [25] expands on this by arguing that any hypertext source format should be text based and geared towards sharing not just between different hypertext systems, but between other software tools as well. The rest of this section will take a brief look at three commercial document standards (OLE2, OpenDoc and Acrobat), an evolving academic standard (HTML) and two international standards (MHEG, and HyperODA). The section finishes with a longer description of the HyTime standard and some examples of its use.

5.3.1 OLE2

Microsoft's OLE2 ([100]) defines a structured storage model which is imposed on the contents of a file. The model is hierarchical in form, using public format substorages which contain private format data streams. The substorages can be interpreted by any interested party, and provides a directory of names and attributes for accessing the data streams (which may only be interpreted by the owning application). Each high-level document component (such as a spreadsheet or a picture) is likely to be held in a separate data stream. The individual objects (data streams) are referred to by monikers, using a combination of file reference and object reference within a file. Unfortunately, monikers do not yet work across networks and are likely to break when the files they refer to move.

5.3.2 OpenDoc

Apple's OpenDoc [3, 4] architecture defines a similar structured storage model called Bento. Bento defines the role of a container (usually a file, but may be other entities like a block of memory or a network message) which holds various groups of objects. Each object combines a unique (persisitent) id with a set of properties, each of which consists of a name and a list of (typed) values.

Both OpenDoc and OLE2 provide a mechanism for simple embedded objects, a concept often demonstrated by having a spreadsheet included inside a word-processor document, and both of these technologies emphasise the component nature of documents and applications. Although it currently looks as if a `word-processor' document has objects from foreign applications permitted to exist inside its boundaries, these models anticipate a set of co-operating data objects, operated on by a set of co-operating software components. In such an environment there are no native word-processor documents, spreadsheets or databases: instead there is a collection of data objects which can be operated on by various software components inside a unifying data structure.

5.3.3 Acrobat

Acrobat is commercial software from Adobe for encoding documents in a platform independent fashion [21]. It defines a platform-independent multimedia document encoding called PDF (portable document format) in which a document is composed of a set of objects (page, text, graphics, images and links) with fixed relationships between them. Structure, annotation and page preview objects are also supported, and video and audio objects are anticipated in a forthcoming release. All objects are coded in PDF's 7-bit text representation, although they may contain data streams which can be decoded and decompressed into various standard media types (e.g. JPEG). The objects are elaborated in the file contents between obj/endobj markers, and an optional object index pointing into the file contents is located at the end of the file in order to speed random access to individual objects. To elaborate a PDF file, an application locates the root object, unpacks a reference to the objects which represent the page catalog, and for each of the page objects listed in the catalog renders the text and graphics objects which compose that page. PDF is a commercially successful, contemporary example of a novel object-based document architecture, but it is targetted towards one specific application. Its objects are pre-formatted for presentation onto pages of a specific size and no `abstract' information is held with any of the objects--it is a non-trivial task simply to extract the text from an object. Even a hypertext link object has its source and destination anchors defined in terms of a rectangle in the page co-ordinate space.

OpenDoc, OLE2 and PDF make use of an object-centered architecture, but still associate the document's contents with a single file. PDF's objects have publicly-defined contents which must be read in order to understand the composition relationships that build the document. By contrast, OLE2 defines a well-understood hierarchy of containment which can be used without understanding the format of the component data streams. OpenDoc devolves the responsibility of composing the document to the controlling application, providing neither a publicly-understood containment mechanism, nor publicly-understood object formats.

5.3.4 MHEG

MHEG is a forthcoming international standard for interchanging hypermedia objects [115]. It is a container architecture which allows media objects to be represented according to an appropriate (external) standard along with instructions for their presentation and behaviour. MHEG objects are to be encoded according to ASN.1 or SGML. It is intended as a practical interchange standard for `industrial strength' hypermedia applications requiring real-time interchange and addresses the problem of exchanging multimedia objects for presentation. In this way it is very different from HyTime: display and control semantics are a part of the MHEG standard, whilst in HyTime these are devolved to the controlling application.

5.3.5 ODA and HyperODA

HyperODA is a set of proposed extensions to the Open Document Architecture standard (ISO 8613). ODA is a container architecture which represents text and graphics (each expressed according to an appropriate external standard), providing logical (abstract) and layout (physical) views of the document in parallel. HyperODA extends this model with extra content architectures for audio and images, as well as the link objects and temporal layout. HyperODA is similar to MHEG in that it associates presentation semantics with the document objects, but is more prescriptive than MHEG since it constrains the representation of the component multimedia objects to a small number of international standards. This has the advantage that two HyperODA-compliant applications can always completely understand any document that they exchange, but has the disadvantage that new kinds of media object (movies, for example) cannot be represented without a new version of the standard being defined. HyperODA's similarity to HyTime comes from its association of logical structure with the document components.

5.3.6 HTML

HTML (HyperText Markup Language) is a SGML-based document architecture used by the academic `World-Wide Web' (WWW) project. It is designed for simply structured textual documents with embedded graphics and hypertext links, and uses a very simple document structure consisting of headings, paragraphs, lists and various inline text styles. HTML's links are similar to HyTime's contextual links and can be expressed in a HyTime-compliant fashion [50]. HTML defines a single document architecture with general rendering semantics.

5.3.7 HyTime

Many of the above standards cater only for a fixed, local set of component document objects and most (such as OLE2 and Acrobat) project a fixed view upon them. In a distributed information environment (such as the Web) it is necessary to consider the possibility of using material from an enormous collection of resources. In such a case it is likely that the objects that are being used are not solely stored in the same file. HyTime [36, 37, 50, 75] is a recent international standard for encoding document structures and expressing the relationships between them that addresses this issue. Built on top of SGML [76] it extends the addressing model to allow access to objects outside the scope of the immediate unit of data storage. It also allows subparts of objects to be treated as objects, provides a mechanism for specifying objects by a query based on an abstract property associated with the object, and provides the ability to define objects based along an abstract co-ordinate system (e.g. timelines or pixel grids). Unlike OLE2, Bento and PDF, HyTime allows an object to be defined as an arbitrary view on an underlying entity, even as an aggregation of such views. Instead of having a rigid container architecture, HyTime allows the author to make arbitrary decisions about what constitutes an object for a particular purpose.

HyTime presents a significant step away from the notion of a document as a file by building on SGML's concept of a document as a group of entities. A HyTime document has an explicit hub which is the central defining element of the document's contents and from which the contained objects are linked via a well-defined set of relationships. The aim of HyTime is to preserve information about the scheduling and interconnection of related components of a hypermedia document (e.g. audio, music score and libretto in a CDROM version of an opera) that would otherwise be embedded inside application-specific `scripts'.

When SGML was proposed as a standard it was becoming more commonplace for authors to exchange individual documents electronically and the requirement was for a common medium for expressing these documents. In recent years the development of international networks (such as the InterNet) has enabled sharing on a wider scale with repositories of documents and multimedia information being set up across continents. One of the important needs is to be able to tie these information resources together, linking to or citing other works published on a remote server. Many common applications do now provide hypertext facilities, enabling the linking of information. However most of them do this as a product of an internal scripting language: the links are hidden and exist as a consequence of the execution of a program rather than explicitly declared data objects making it difficult to exchange the data between applications.

HyTime markup can express important information about documents: about their structure and the way they should be presented. This information is added value--it allows a document to be reused and interchanged between systems for many purposes and as such is an economic consideration. The benefits of generalised markup (as exemplified by SGML) for representing document structure are increasingly appreciated, especially in commercial and military organisations which have to deal with large volumes of information. Projects such as the Oxford English Dictionary [119] illustrate the benefits of this approach both for the production of different versions of the dictionary in printed form and for the production of a CDROM-based version with advanced searching capabilities.

HyTime is a methodology for describing document features and the relationships between different parts of documents, but it does not prescribe the meaning of these features or relationships. It uses terms like `hyperlinking' without defining what happens when a link is followed, or even how a link is activated. HyTime is not a system that can be executed to display multimedia documents and jump between document objects using hyperlinks. An application would need HyTime added-value to interpret a HyTime-compliant document and render it for display. It is in fact anticipated that the main use of HyTime will be for encoding documents for interchange between various proprietary systems, and although the HyTime standard provides various facilities to speed up native rendering of a HyTime document, HyTime is not necessarily the most suitable format for coding multimedia material.

Although HyTime is used to mark up hypermedia documents in conjunction with SGML it is not a single document architecture (i.e. it is not a DTD). Early versions of the draft standard did in fact define a HyTime DTD, but this was abandoned as being too restrictive. Instead HyTime is often referred to as a meta-DTD since it provides a set of standard components (or `architectural forms') which can be used to construct document architectures. As such, HyTime defines a (very large) family of document architectures, and rules for constructing their DTDs.

HyTime is both abstract and specific: it provides abstractions of facilities that are useful in building document architectures, but is very specific about how these abstract facilities must be coded. HyTime constructs are expressed as combinations of SGML elements and attributes which have to be interpreted by a HyTime engine subsequent to their parsing as SGML elements. Although such a HyTime engine may appear to play the role of a post-processor for SGML files, a more co-operative role is needed, since the SGML parser may be required to provide access to any objects in external entities which the HyTime engine needs to interpret. In fact, both HyTime and SGML processing engines are likely to be components of a larger document handling environment.

HyTime is primarily concerned with documenting the relationships between different parts of documents. SGML already has facilities for making references between elements of a document: elements may be labelled with an id attribute and then referred to by that label in another element's idref attribute. This facility can be used to implement cross-references, hypertext jumps, object class systems, style sheets or many other constructs; however it is quite restrictive for a number of reasons. Firstly, only whole elements may be addressed, and so document objects are rigidly defined with quite a coarse granularity--it is not possible to quote a reference to a relevant fragment of a paragraph. Secondly, every element which is to be addressed must be explicitly labelled (conversely, only elements which the author has bothered to label may be addressed). This is not a worrying restriction to the originator of a document, who is free to make whatever labelling additions may be desired, but an author who is trying to `link in' to an existing work (a standard reference resource such as a dictionary, or a seminal academic paper) may have great problems expressing an arbitrary link using just an idref. The third problem with SGML idrefs is that they may only refer to labels within the same document. This makes linking to external reference works impossible without including them in their entirety through an entity reference.

Thus in order to allow flexible linking of documents, one of HyTime's major functions is to extend SGML's object addressing model. Object addresses may be constructed from a combination of sub-addressing techniques, starting from a well-known object, such as an SGML named external entity or a previously labelled SGML element (or HyTime object). From such a starting place it is possible to repeatedly narrow down the address by taking a linear offset from one of the ends of the object, or by specifying a hierarchical position within a tree-structured object. Object addresses (or a part of an object's address) may also be specified as the result of a query on the various properties of the document (its structure or data content). This flexible addressing mechanism may be used, for example, to allow a literature student to refer to a specific word or phrase buried inside a paragraph of a read-only document that is not even marked up in SGML.

HyTime also provides some facilities for describing hypertext links, providing a standard for representing marked up links. Links can be contextual (i.e. embedded in the document at one of the anchor points) or independent (occuring at a position in the document which is unrelated to any of the objects that it links). A link can have multiple end points, with a role assigned to each endpoint and a rules controlling traversal between the endpoints.

HyTime is a modular standard, with the document designer free to choose only those facilities which will be needed. The base module is always required and provides facilities using SGML constructs for object representation and addressing, as well as miscellaneous facilities for other HyTime modules. The measurement module defines the concept of addressing document objects according to a measurement along some abstract dimension (for example words 3 to 27 could be a measurement within a paragraph). Various standard units are defined for familiar temporal and spatial measurements. The location address module allows reference to be made to document objects which cannot be addressed with the normal SGML facilities of the base module: these objects can be referenced by name, position or query. A location ladder can be built up of gradually more and more specific location addresses (e.g. the draft chapter's fourth heading's third word's second letter). The hyperlinks module provides methods for representing link objects (based on the various object addressing and representation methods provided above) and the semantics associated with traversing the link. The scheduling module provides events which are objects positioned within a multi-dimensional co-ordinate space. The rendition module provides ways for describing the modifications that can be made to an object within an event and the ways that events can be projected from one co-ordinate system into another.

Using HyTime in Text Processing

Text processing is mostly concerned with the production of printed documents and requires information to be moved (to produce footnotes), copied (to produce tables of contents) and collated (to produce indexes). HyTime is useful in this situation because it can describe the relationship between individual text items and the larger document structure.

For example, in order to produce an index a list of terms has to be decided on, and then all the relevant occurrences of each term (or its synonyms) must be referenced in the text. Each index entry catalogues the relationship between its term and a number of occurrences in the document and can thus be modelled by a hyperlink (despite its name a hyperlink does not necessarily have anything to do with hypertext, only the connection of two document objects). A hyperlink encodes a connection between several document objects called `anchors' of the link, and assigns a `role' to each of the anchors. For an entry in an index there could be two anchors--the term to be indexed and the set of its occurrences within the document.

Figure 5.2a shows such an index entry. It connects a `term' element to an `occurrences' element whose instantiations have ids `t1' and `o1' respectively. Both elements are declared to occur inside the indexentry element. Indexentry is not part of HyTime, it is simply defined in the DTD (as shown in Figure 5.2b) with HyTime standard attributes. It is the HyTime attribute that identifies the indexentry as being an example of an independent link (ilink) to the HyTime engine. The HyTime engine can then handle the value of the linkends attribute to find the various anchors for the text processing application to use. (More likely the value of the anchroles attribute would be fixed in the DTD and so not given in the document instance itself.)

In text processing environments, index terms are frequently given special markup in the body of the text. If this is the case, HyTime may locate the term's use by referring to the markup's id. If this is not the case, or the indexer does not have write access to the document's text, then HyTime may locate the index entries by using a dataloc (data location) element. A dataloc element identifies an anonymous span of data within another named object (called the location source, or locsrc, perhaps an element with an id or a named entity ) by giving an offset from one end of that object and an extent. For example, if this section (entitled `Text Processing') had been marked up with an id of textp, the following examples of a dataloc element could address the word `production', either by counting characters or words from the start of the section. (The dimlist element treats its numbers as a measurement along an abstract dimension, in this case the data content of a section element.)

<dataloc locsrc=textp quantum=str><dimlist>45 10</></dataloc>

or <dataloc locsrc=textp quantum=word><dimlist>8 1</></dataloc>

<indexentry anchrole="term occurrences" linkends="t1 o1">

<term id=t1>multimedia

<occurrences id=o1>oc1 oc2 oc3</>

</indexentry>

Figure 5.2a: An entry in an index

<!ELEMENT indexentry - - (term, occurrences?)>

<!ATTLIST indexentry HyTime NAME #FIXED ilink

anchrole NAMES #REQUIRED

linkends IDS #REQUIRED>

Figure 5.2b: Defining an IndexEntry construct in the DTD

<!ATTLIST occurrences HyTime NAME #FIXED nmlist

nametype NAME #FIXED element>

Figure 5.2c: Defining an Occurrences construct in the DTD

Since each term appears numerous times within the document the `occurrences' anchor is a HyTime multloc or multiple location, which consists of a list of ids, each resolving eventually (perhaps indirectly through a dataloc) to a word in the document.

By use of the HyTime-based indexentry document structure given above, we have enabled the document designer to express connections between a specific document object (here a piece of text) and numerous places in the document. This allows the index to refer not just to occurrences of a particular word, but to whole paragraphs of text, or pictures and diagrams. It is the responsibility of the index creator to decide how to represent each of these connections.

Using HyTime for Presentations

Often there is a requirement to make a presentation based on the information drawn from a collection of books. This presentation not only imposes a temporal ordering on the information but also allocates a time-span to particular pieces of information, based not on the length of the content, but on its perceived importance. An educational course syllabus is an example of such a presentation.

HyTime can be used to represent such a course syllabus by using a finite co-ordinate system (fcs) to represent a timeline, and then mapping each component of the course onto the appropriate position on that timeline.

<!ELEMENT semester - - (courseschedule)+ >

<!ATTLIST semester HyTime NAME #FIXED fcs

axisdefs NAME #FIXED timeaxis>

<!ELEMENT courseschedule - - (lecture)+ >

<!ATTLIST courseschedule
HyTime NAME #FIXED evsched>

<!ELEMENT lecture - - (content)+ >

<!ATTLIST lecture HyTime NAME #FIXED event

exspec IDREFS #REQUIRED>

<!ELEMENT duration - O (#PCDATA)

-- LexModel(snzi, s+, snzi) -->

<!ATTLIST duration HyTime NAME #FIXED extlist

id ID #REQUIRED>

<!ELEMENT content - O (#PCDATA)>

<!ATTLIST content HyTime NAME #FIXED nmlist>

Figure 5.3a: Defining a Timeline in a DTD

<semester><courseschedule>

<lecture exspec=single>

<content>chap1</>

<lecture exspec=dbl>

<content>chap3 chap4 sect6</>

<lecture exspec=single2>

<content>chap2</>

</courseschedule></semester>

<duration id=single>26 1</>

<duration id=dbl>37 2</>

<duration id=single2>78 1</>

Figure 5.3b: Using a Timeline

Figures 5.3a and 5.3b show the definition and use of such a timeline. In figure 5.3b we see that a semester contains a course schedule which contains a number of lectures, each of which contains a set of contents and refers to a duration for the lecture. The contents themselves are references to the contents of a text book, perhaps indirectly through a dataloc. Figure 5.3a shows how this is defined using HyTime's constructs. The semester is an example of a finite co-ordinate system whose axes are defined by a timeaxis structure (not shown here). In fact there is just one axis here (the time axis) which would be measured in `teaching blocks' for convenience. The courseschedules which it contains are examples of HyTime's event schedules. Each schedule contains many events (lectures in this example) which tie a document object (the content elements) to a position and extent in the co-ordinate system (place the content along the time axis).

The purpose of the duration elements (HyTime extlist) is to specify the start and extent of the event in the units of the co-ordinate system. This example uses particularly opaque measurements, so to make it more useful to a human it would be better to project the events in this co-ordinate system onto a natural calendar by using the event projector facility of the rendition module.

Using HyTime for Hypertext Interchange

Exchanging documents between word-processors is a common problem for which one solution lies in the manufacturers of each program making translators available for importing documents native to all the other (commercially successful) programs. An alternative approach is to define a common `document interchange language' (such as Microsoft's Rich Text Format) for which each program only needs to provide an `export' and `import' facility. A similar problem exists for exchanging hypertext documents between hypertext systems.

Microcosm [5] is an open hypermedia system developed at the University of Southampton. One of its chief features is that no information concerning links is held in documents; instead all link information is held in external linkbases which contain the required details about the source and destination anchors of the links. It comprises independent components (document viewers and link managers) which communicate by passing messages. Working in such an open environment means that the system response may be sub-optimal and so hypertexts developed in Microcosm may be translated to a cut-down but optimised delivery environment (such as Microsoft Help). One of the major problems inherent in such a translation is that the linking facilities of the two systems may not directly map onto each other. The rich nature of HyTime's linking capabilities make it possible to translate hypertext semantics into a HyTime representation without loss of information and it is therefore useful to use HyTime to form an intermediate representation (a kind of `Rich Hypertext Format') as a midway stage in mapping between two hypertext systems. The translation process then divides into a sub-process that converts a native Microcosm dataset into a HyTime-based representation, and then further translation process to convert (possibly a subset of) this HyTime representation into another hypermedia format [6].

\DocID history.intro \Offset 246 \Selection Mihailovich

Figure 5.4a: Microcosm Address Tuple

<nameloc id=histDoc>

<namelist nametype="entity">history.intro</></>

<dataloc id=mihail quantum=str locsrc=histDoc>
<dimspec>246 11</dimspec></>

Figure 5.4b: Address Tuple as a HyTime Location Ladder

The most common Microcosm addressing mechanism is the (document id, offset, extent) tuple. The Microcosm address specification tuple in figure 5.4a references a string of (implicit length) eleven characters starting at character offset 246 of a document whose id is history.intro. It could be expressed as the two-stage HyTime `location ladder' in figure 5.4b, in which the first (nameloc) element associates an SGML id histDoc with the document, and the second (dataloc) element locates the string within the identified document. Any reference to the name mihail will now resolve to the requested object.

HyTime links may have more than two anchors, and the document designer has to provide semantics for each of the anchors. By contrast, Microcosm links have only two anchors (source and destination), but a destination anchor may be composed of many documents' objects (the equivalent of a HyTime multiple location). HyTime links can take two forms--contextual links, whose definitions appear at one of the sites of the link anchors (i.e. in context), and independent links, whose definitions are given at some other place in the hyperdocument. Microcosm links are always of the latter type, since link definitions are stored in separate linkbases, referring to their anchor positions through the addressing mechanisms above.

A Microcosm linkbase can now be modelled as a collection of HyTime independent links:

<mcmlink anchrole="source destination" linkends="srcid dstid"
endterms="linkdisp1 linkdisp2">

where the multiple destination may be specified as a simple list of destinations as follows:

<nameloc id="dstid"><namelist nametype=element>
destid1 destid2 destid3</></>

This example is similar to the index example given previously, except that the information given by the link endterms is intended to specify how the link source and destination are to be portrayed--here the source is formatted as a button and provides a short preview of each component of the multiple destination. This is achieved using elements of the following form:

<displayinfo id="linkdisp1"> <anchorformat>button</></>

<displayinfo id="linkdisp2"> <anchorformat>normaltext</></>

which are referred to by references to their unique identifier (id) within the mcmlink element.

A Microcosm link may completely specify its source anchor (in terms of document, offset and content) in which case it is known as a specific link. But by leaving the offset or document unspecified the content acts as a source anchor for this link anywhere that it appears in any document. This is a generic link which no longer contains explicit connections to a source document location.

HyTime makes provision for locations to be specified as the result of a query performed on the content or structure of a document, defining a standard query notation (HyQ) for this purpose and it is possible to express the source locations of a generic link with such a query. This can be done by replacing the explicit dimension specification (dimspecs) in figure 5.4b above with an axis marker query which represents a matching operation against the required texts. Any query notation (e.g. regular expression searches) is allowed in this context. For specific links, the source specification srcid resolves (through a dataloc) to a single location. For generic links, srcid resolves to a multiple location through a query which returns a dataloc for each occurrence of a particular piece of text, where the query domain is either a single document (local link) or the entire hyperdocument (generic link).

5.4 Lace '93--A New Document Architecture

Click here for Picture

Figure 5.5: Lace '93 environment

In this section we present a distributed, structured, object-centered multimedia document architecture (Lace '93) which addresses the issues of document models and authoring rhetoric to provide global hyperdocuments, allowing authors to apply coherence to the lexias which compose an existing hypertext network. It does this in a number of aspects:

* providing a document architecture which allows a document to be represented as a collection of components, both local and remote

* providing explicit relationships between the components

* promoting a style of authorship (a rhetoric) which encourages the merging of local and remote components

* defining a viewer which can display the components and their relationships

A Lace '93 document is a particular view on a set of objects. It is implemented as a file containing a set of objects (or object specifications), a set of relationships between the objects, and (possibly) a set of local definitions for the implementation of the relationships. A Lace '93 environment, therefore, consists of an object manager, a relationship manager and a display manager.

5.4.1 Lace '93 Objects

Lace '93 makes an object the fundamental component of a document. These objects are defined in a similar fashion to HyTime: a dynamic view on an underlying data storage medium. The objects are declared as part of the Lace '93 document, and may be either literal objects (i.e. objects whose contents are included in situ) or indirect objects (i.e. references to objects which are stored elsewhere). The kinds of object references currently implemented are:

file reference: the contents of the object are the name of a file on the local host's file system

WWW reference: the contents of the object are the Universal Resource Locator of a document available via the World-Wide Web

ruler reference: based on HyTime's data location facilities, the contents of the object are interpreteted as offset measurements from the ends of another object.

The ruler reference is used to define an object as a subpart of another object. The measurements are either of the form

m n the object consists of the n characters starting from the m'th character from the start of another object (these semantics are borrowed from HyTime's dimspec facility)

m -n the object starts at the m'th character and continues to the n'th character from the end of another object (these semantics are borrowed from HyTime's dimspec facility)

/first/ /second/
the object starts at the first occurrence of the character string /first/ and continues to the next occurrence of /second/ in another object. If the first character of the second string is the caret (^) then the selection finishes immediately before the second match. This is a convenience and not based on a HyTime facility.

Since Lace '93 is a distributed document architecture not all the document's objects may be contained in the same file, or even on the same machine. In the extreme case (most likely in resource-based publishing) all the objects may be held externally with only the object relationships in the Lace '93 file.

5.4.2 Lace '93 Relationships

PDF files consist of a set of objects, a table of object offsets (used only to speed object access) and a reference to the root (top-level) object. The objects are undifferentiated from a storage point of view, but object relationships are described within each object and so each object must be interpreted not only to display it but also to be able to access the other objects on which it relies. This is a similar situation to embedding hypertext links inside a document with similar disadvantages.

PDF defines certain `implicit' object relationships, namely

a document    is composed    a set of pages          
              of                                     
a page        contains       a set of components     
a page        requires       a set of font objects   
a page        is described   a thumbnail             
              by                                     

Lace '93 will allow variant instantiations of a component based on physical rendering criteria (what resolution can the display provide?) or abstract rendering criteria (what are we trying to communicate to this student?). This would be enabled by allowing new relationships, such as
a text object      translates into      another text object     
                   French                                       
an image           previews             a hires photo           
a PostScript       renders              an SGML object          
object                                                          

Obviously these new kinds of relationships cannot be predefined by an International Standard, but must be allowed to be defined on a per-document-type basis. For this reason the object relationships must be labelled with an identifying `type' so that the displaying application knows how to treat the objects, and, starting at the root of the document, knows how to render the collection of objects as a whole. This kind of relationship facility is provided for in HyTime by the hyperlink, a construct which despite its name does not necessarily have anything to do with hypertext. A hyperlink simply associates a group of objects together, ascribing `roles' to each of the objects. A group of images could be tied together by a hyperlink with anchor roles "hires medres lowres caption".

Obviously there must be some complicity between the document type designer and the application designer so that the application can treat each object relationship appropriately. Some default rules for the treatment of unrecognised relationships would be necessary, such as `ignore the related objects' or `treat the relationship as equivalent to contains' or `treat the relationship as equivalent to contains-only-the-first-object'.

Let us adopt a notation for expressing the relationships between document objects: rel1(obj1, obj2, ..., objn) expresses a relationship, rel1, between the objects obj1 to objn. The following relationships may be useful for Acrobat

previews(thumbnail1,page1)          a page's thumbnail preview object     
contains(page1,text1,text2,photo1)  the objects on a page                 
required(font1,text1,text7,text9)   resources required for rendering      

whereas the following extended relationships may be useful in Lace '93
abstracts(text1,page3)              a text object summarises the          
                                    information on a page                 
image(photo1, image2, bitmap3)      three different image formats of      
                                    the same data                         
revision(text1, text2, text3)       three different versions of the       
                                    same piece of text                    

The meaning of the first three relationships are built into Acrobat, whereas the latter three are not. Lace '93 makes it possible to use arbitrary relationships, but requires some method of defining the meaning of those relationships. It is would be possible to produce an extended set of relationships and "hard-wire" them into Lace '93, but it is preferable to allow the relationships to be defined on a per-document-type basis. This is similar to SGML's approach which abdicates responsibility of interpreting the meaning of the document to the relevant application while making as much of the content semantics as explicit as possible. The problem then is how to adequately express the semantics of the relationships in an open fashion.

Microcosm's hypertext model allows the declaration and manipulation of arbitrary relationships between document items and so may provide an ideal engine for elaborating the relationships in Lace '93 documents. It has been demonstrated previously (section 4.3.2) that Microcosm's declarative link model caters for arbitrary link relationships, allowing either labelled relationships (such as the above revision, image or abstracts) or relationships referred to by an explicit specification of their semantics. However, in that model the relationships are specified mainly in terms of the object addresses. What is required by Lace '93 is not only a mechanism of expressing a fixed relationship between varied objects, but a mechanism for expressing flexible relationships between fixed objects. Here we are leaving the world of relationships between static document attributes and entering the world of relationships which depend on dynamic, runtime attributes of the system.

Microcosm services are usually invoked to follow a single link from one complete document to another, but what is proposed here is that Microcosm services are invoked en masse, in batch from a document's hub to build up the necessary view of the document, by resolving the links to the document components.

Let us take as an example some real but simple object relationships and examine their semantics. Firstly, a document contains a set of pages. This containment relationship implies that the super-object consists of a sequential elaboration of a set of sub-objects. To render the whole document it would be necessary to construct each subobject in order, however in an interactive application it is likely that the sub-objects would only be evaluated upon instruction from the user. Compare that containment relationship with a page contains a set of text and image objects. Here the subobjects must be evaluated (in any order) to produce the super-object. A text object requires a font--this relationship implies that another object (not a subobject) must be previously elaborated in order to display the first object. An object is an alternative rendering of another object--this relationship implies that if a given object, required for display by a superobject, is unavailable or unsuitable for use, then an alternative object may be substituted in its place. This is a very general relationship and covers the case of alternative image resolutions, duplicate object copies available for network services and different renderings of a piece of textual information. The alternative object would normally be ignored, unless some special criterion was fulfilled (e.g. is this object's server down?). An object summarises another object--this relationship implies that a given object contains a compact rendition of the information in another object. It could be used to provide abstracts for documents and document parts, or captions for figures and tables. The summary object may very well be ignored unless asked for by the user. An object explains another object. This relationship is similar to the above, but gives a more, rather than less, detailed explanantion.

Of these relationships, we can see three principal types of relationship: object containing other objects, objects being alternatives for other objects and objects provividing extra information about other objects. These relationships are prototypes, or superclasses of the actual relationships that are intended and may be represented using the HyTime device of architectural forms. Using this device, each relationships, expressed as an SGML tag, has a #FIXED atribute which labels its supertype (e.g. attribute lace93 may have value contains) as well as other attributes which are used to define further relationship semantics. When the document is interpreted, each relationship element would have its lace93 attribute inspected to determine how to treat it.

In the current prototype, only the contains and extra supertypes have been implemented to allow rudimentary document composition. A number of slightly different containment relationships have been implemented, such as requires, where the contained objects need to be elaborated before the first object and includes, where the contained objects are elaborated after the original object. The extra objects may be elaborated in various ways, depending on the viewing application, document semantics and user's preference. Either the extra information may be incorporated into the document itself (perhaps as a marginal paragraph or with some visual highlighting to separate it from the contained text) or a reference to the material may be included in the form of a button or a menu choice.

The current prototype also does not yet use Microcosm link services for implementing the inter-object relationships.

5.4.3 Lace '93 Rendering

Acrobat Reader is an application that not only interprets PDF object relationships but also renders each of the document objects onto a set of pages. Since the PDF format is based on PostScript, it uses an internal PostScript engine to decode the objects and produce the on-screen images. Early Lace '93 prototypes used pre-formatted PostScript objects as Acrobat does, but it was decided to use a higher-level physical document model to allow the use of non-PostScript objects without the added burden of formatting semantics. Although RTF was a strong contender because of its sophisticated document model, HTML was chosen since the native environment of Lace '93 would be the World-Wide Web. A disadvantage of HTML is its particularly limited document structure: it is not possible to represent marginalia or annotation elements.

Here are the steps which are followed in order to render a Lace '93 document (an example of such a document and the resulting HTML document is given in Appendix 2.3).

* Parse the document object specifications and the object relationships.

* Find objects which have a relationship of supertype contains with respect to the root object.

* Render each of these objects (by recursively looking for contained objects) into the Display Manager native format and also each of the objects that are extra to these objects (this step is not recursive if the objects are not to be contained in the document).

* Send the composed document to the viewing application (in this case Mosaic).

The dynamically composed document should contain hypertext facilities (e.g. buttons) to lead the reader back to the original objects from which it was composed.

5.5 Impact of the Model on Hypertext Production

The challenge at the start of this chapter was whether it is possible to expand authored coherence beyond the bounds of a locally coherent lexia, into a global hyperdocument. It was into that context that the Lace '93 object-centred document model was introduced. This environment is based on the model of a document as a view upon a set of objects, and is a fundamentally different document model than that in general use today.

If the general understanding of the nature of a document changes, then this affects the use and production of documents, and the way that information is disseminated. If the unit of information dissemination is no longer a complete magnum opus but a finer-grained object, then information can be shared more effectively, references may be more precise, and the goal of effective information re-use is more achievable. So one of the barriers to a more coherent view on a global hypertext network is the implementation of documents as files, because it hinders effective sharing and reuse of information resources. It also causes the problem of chunking, or splitting information into a set of nodes (see section 1.2.2), where each node (like a file) is well-defined with inviolable boundaries.

One of the problems inherent in the Web is that it promotes the old relationship document [[equivalence]] file. A single, complete file is viewed as an entity, information is authored and presented in terms of these entities, and links are made between information elements embedded in these entities. In order to express coherence it is necessary to make finer distinctions in the information to be presented. It is also necessary to present the information in a more immediate fashion than the traditional "click here to see this" paradigm which both discourages the reader [146] and increases their disorientation [27].

Thus the Lace '93 document architecture achieves the goal of providing global coherence by allowing hypertext authors to work with information components rather than information containers: units of information which can be reused and represented in new contexts and for different purposes. This is of particular significance for the users of a working environment which is becoming increasingly distributed (because of the Internet) and whose components are being shared to an unprecedented degree (because of the WWW).

6. CONCLUSIONS & FURTHER WORK

Coherence is the touchstone of the document world: it is what an author brings to a morass of information, and it is expressed through various structural devices. This thesis has involved a number of pieces of the author's work involving structure and coherence in the development of hypertext documents from local through to global scales.

Lace displays the isomorphism of text and hypertext, providing a mechanism for expressing local coherence through complex structured lexias.

Lace-92 provides a mechanism for creating locally coherent, complex, structured lexias from the contents of a global hyperbase.

Lace-93 implements true hyperdocuments: a way of applying a coherent, authored view to a global hyperbase of components.

The previous chapter of this thesis argues for the redefinition of the fundamental nature of a document as a view on a set of distributed objects. This is of particular significance since documents are increasingly being defined in terms of objects, and objects are being managed increasingly in a distributed context [120].

Other pieces of the author's work also support this conclusion: the work on the World-Wide Web in section 4.1, 4.3.4 and 4.3.5 shows up the failings of current document technology to produce a coherent distributed document environment; the work with Microcosm shows the usefulness of generic link specifications in producing flexible structures for hypertexts (section 4.3) and defines a formal specification of Microcosm link semantics (section 4.3.2) that can be used as the basis of a link engine for the proposed distributed document architecture.

6.1 Structured Authoring

One of the themes of this thesis is the use of structure within hypertext environments of various kinds. Many recognise the advantages or even the requirement for such a structure. The SEPIA team concluded that there needs to be more structure in the authoring of hypertexts [138]. Thüring et al [139] propose design rules for hypertexts to maximise both local and global coherence of a work. Others agree, perhaps implicitly, that there is an overall structure to constrain and direct the information content of a hypertext: Carlson [34] insists that every node in a network should be annotated sufficiently for the reader to conceptualise its place in a global context and Landow [86] proposes a novel arrival and departure paradigm, in which designers must provide the user with orientation information at each end of a link.

There is however a definite anti-structure debate for the `cutting edge' of hypertext use. Moulthrop [102] argues that attempts (such as the above) to coerce hypertexts to behave like printed texts wrongly constrain the medium when it should be acting firstly as an adjunct to print (allowing authors to experiment with a dynamic text) and then as an independent deconstructive literary medium. It also seems ironic that the mechanisms for expressing logical document structure which in Lace are used to completely specify the semantics of a document are used in Lace '93 to describe a less well-prescribed document semantics based on the dynamic determination of object relationships.

The use of structure as a document construction tool is one area which seems well worth following up. Lace '92 based its information retrieval tools on the then-developing WAIS service before the WWW project gained its enormous popularity. By providing a single addressing scheme for many current information services (HTTP, FTP, USENET news, WAIS) it is possible to reference almost every document held online on the "information superhighway", but there is little support for any task other than browsing and so to extend the domain of Lace '92 to include the Web would be to not only add functionality to a user's Web interface, but also to increase the number of documents which are written for the Web, and which are linked into its global literature.

Extending Microcosm to support WWW access (described in section 4.3.4), provides chaperoned access to the Web: documents are automatically imported by the user's document manager and are classified for future reference according to subject material, session time and current task. Making Lace `92 into a filter for Microcosm could define an authoring agent capable of keeping track of the various authoring tasks assigned to an individual (write a lecture on object-oriented databases, a paper on hypermedia standards, a literature review of CSCW) and automatically records documents as relevant sources when the user browses them.

6.2 World-Wide Web Analysis

Experience of browsing the WWW or random samples of the logs from the hyperfind program are not sufficient to make definite pronouncements about the state of the Web. Instead one needs to make quantifiable measurements of the hypertext, summarised into useful statistics (see [19] for a similar procedure carried out on a non-distributed hypertext). The logs produced by the hyperfind program do in fact extract various metrics from each document: size of the node in bytes, number of links from the node, percentage of the node's bytes which are markup and not content, and percentage of the node's bytes which are used to code links (anchor markup and anchor content). The data retrieved is planned to be used to make an automatic analysis of the pattern of Web usage: whether genuinely well-connected documents are being authored, or, as experience indicates, whether the Web consists of hundreds of separate hierachies, with the roots of each hierarchy well-connected to the roots of others.

The hyperfind script has already been run on the URL of a known WWW catalogue (the WWW sites list maintained by the National Centre for Supercomputing Applications at the University of Urbana-Champagne in Illinois) and on the URLs of several major Web sites (Cern in Switzerland and JNT in the UK). This exercise has currently produced a list of some 12,000 nodes at 600 sites.

Large nodes (more than a few kilobytes) are likely to be complete documents rather than small chunks of information. Single-lexia documents should be nodes with a high degree of internal structuring and should have a relatively high proportion of markup. The minimal markup required for a document (to frame its contents) comes to about 100 bytes. If a document has a low markup percentage it could be due to the fact that it contains minimal markup. However, the markup required for adding headings of different levels does not add a significant volume to the node. It is the link markup which adds a significant amount since the URLs (about 50 bytes long on average) are coded as markup attributes, not document content.

If the figures for markup size and link size are very close, then it is likely that most of the markup is being used to code links. This is frequently seen in catalog nodes which simply exist to point to other documents. It is also seen in documents generated as directory listings: they contain a title and one link for each file in the directory. In these cases the number of links will usually be quite high.

Nodes with a large proportion of links are probably catalog or hub nodes which exist only to point to other nodes and may be independent of the `content-bearing' network nodes. Nodes with a smaller proportion of links may just contain cross-references to nodes with related content.

From the node metrics it is possible to examine the connectivity of each node--how many nodes does it link to and (less conclusively) how many nodes link to it? Are the linked nodes within the same resource, within the same site or organisation, or world-wide?

The purpose of this work is to test the hypothesis that there a not a truly world-wide web, but a world-wide collection of local webs based on the hierarchical structures of the underlying services on which the Web is implemented.

6.3 Lace '93

The Lace '93 document model is a controlled view upon a set of objects. This requires the ability to be able to keep track both of the objects and the ability to keep track of the meaning of the objects as defined by their relationships to each other.

The author has postulated that the declarative Microcosm model is useful for elaborating the relationships between the objects, and that the relationship section of a Lace '93 document is in fact a private `linkbase' which acts upon the private `docuverse' which is the set of objects which may compose the document. This is the topic of further work as it is not yet clear how:

links can be specified in terms of dynamic session properties rather than static document or object properties

Microcosm flexible link semantics can be usefully combined with user-defined link relationships

In order to keep track of the document objects it is necessary to employ some form of distributed object manager which is different from the common file managers. The object manager should keep track of objects, and allow access to them by various means (id and property queries) but also allow a flexible approach to objects (not a once-for-all partitioning of a file into fixed objects). There are a number of commercial or academic projects which provide object-centred services, which may be of use in this context. CORBA [111] concerns the way in which objects can be interfaced to one another by an Object Request Broker, but it does not clearly define a database for storing objects. It also places heavy emphasis on software objects which embody computation activities, whereas Lace '93 objects are dumb (computational non-active) document components. PCTE, Portable Common Tools Environment [143], does define an object base, but the individual objects can only be accessed by following links from other objects, instead of by querying their attributes. Perhaps the most likely candidate is a Persistent Object Manager (the kernel of every OODBMS) which is used to provide basic object storage and retrieval services [90].

As well as the document model and format which has been described as making up Lace '93, it is necessary to consider an author's contribution to writing coherent documents. As with all SGML and HyTime uses, it is quite probable that the author will not directly manipulate the tagged document format which will be hidden by the user interface. Work needs to be done on the constructs that may be included in the author's conceptual model , e.g.

transclusion The author's interface for specifying the containment relationships.

span links Links which are not so abbreviated as a short button, perhaps displayed in the margin, perhaps consisting of several paragraphs of text

relationships The author's interface to using the relationships between the local and remote objects which make up the document.

As well as considering the author's interface to the document model, work needs to be done on the author's rhetoric: i.e. the kind of statements that are useful to make in this environment and the style of writing and intertextual references that are encouraged (see [110]). Part of this is shown in the experience of the Microcosm project constructing generic, reusable document resources, but some of it is specific to Lace '93--the possibility of using not only straight transclusions but (as the viewing application becomes more sophisticated) marginalia, annotations, commentaries, expansions and other rhetorical devices.

6.4 A Final Word

Although computers were originally intended to perform calculations for engineers and scientists, it was not long before they were being used to handle documents about their programs, and then documents for their own sake. As the economics of computer interaction changed, so they have become more and more used for information storage and retrieval, i.e. document-centred rather than calculation- centred.

In the early days of computing, a document was a sheaf of punched cards. Of course, it was dropped it wasn't a document any more, yet despite this disadvantage the view of document as hardware has had a lasting influence on our view of how computers should implement documents. As punched cards became less common, documents lost their physical status and became files stored on magnetic media, but the legacy of the punched card was seen in the constraints put on these files: every line in the document was limited to a width of 80 characters and the document became a sequence of lines in a text editor.

Sequences of lines became sequences of paragraphs, ASCII text turned into variable width, multi-font, scalable character glyphs, sequential files are now turning into `structured storages' with complex internal organisation, but still a document is considered a bounded data object: something with a metaphorical elastic band strapped around it to keep its contents in. Against this environment hypertext has stood apart, offering exotic display services and connectivity features which cannot be reconciled to documents with impermeable boundaries.

This thesis reviews the way that hypertexts and documents have been constructed, and argues for a re-evaluation of the way we represent and compose computer-augmented documentation which unbinds information from monolithic storage units--documents and files are not synonymous. Both text and hypertext documents can be expressed by the relationships between their (potentially distributed) information components; such an explicit model will finally allow the distinction between text and hypertext to be abandoned.

BIBLIOGRAPHY

[1] ACM, Communications of the ACM, 31(7), ACM Press, 1988.

[2] Angerstein, P., `Summary of the Document Style Semantics and Specification Language (DSSSL), Draft International Standard 10179', International Standards Organisation Document ISO/IEC JTC1/SC18/WG8 N1427

[3] Apple Computer Inc., `OpenDoc: Shaping Tomorrow's Software', BYTE, February 1994

[4] Apple Computer Inc., `OpenDoc: Shaping Tomorrow's Software', White Paper. Available by anonymous FTP from cil.org at opendoc-interest/OD-overview.rtf (1933)

[5] Apple Computer Inc., Macintosh HyperCard User's Guide, Apple Computer Inc

[6] Bacon, R.A., `STOMP: Software Teaching of Modular Physics', Proceedings of the International Conference on Physics Computing, Legano, 1994.

[7] Barron D., `Why use SGML?', Electronic Publishing: Origination, Dissemination & Design, 2(1), 3-24, (1989)

[8] Barron D., Rees M., `Text Processing and Typesetting with UNIX', Addison Wesley (1987)

[9] Bechtel B., `Inside Macintosh as Hypertext', in [120], 312-323

[10] Begeman, M., Conklin, J., `The right tool for the job', Byte, 13 (10), 255-267, (1988)

[11] Benest I, `A HyperText System with Controlled Hype', HyperText II Conference Paper

[12] Berners-Lee TJ, Cailliau R, Groff J-F, "The World-Wide Web", Computer Networks and ISDN Systems, 24(4-5), 454-459.

[13] Berners-Lee, T., `Hypertext Markup Language (HTML): A Representation of Textual Information and MetaInformation for Retrieval and Interchange', Internet Draft. Available by anonymous FTP from info.cern.ch at /pub/www/doc/html-spec.txt (1993).

[14] Berners-Lee, T., `Hypertext Transfer Protocol (HTTP): A Stateless Search, Retrieve and Manipulation Protocol', Internet Draft. Available by anonymous FTP from info.cern.ch at /pub/www/doc/http -spec.txt (1993).

[15] Berners-Lee, T., `Uniform Resource Locators (URL): A Unifying Syntax for the Expression of Names and addresses of Objects on the Network', Internet Draft. Available by anonymous FTP from info.cern.ch at /pub/www/doc/url-spec.txt (1993).

[16] Bookstein, A and Swanson, DR `Probabilistic models for automatic indexing', Journal of the American Society for Information Science, 25, 312-318, (1974)

[17] Bornstein J., Riley V., `Hypertext Interchange Format--Discussion and Format Specification', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 39-47

[18] Botafogo R., Schneiderman B., `Identifying Aggregates in Hypertext Structures', Proceedings of the 4th ACM Conference on Hypertext 1992,63-74

[19] Botafogo, R. A., Rivlin, E., Schneiderman, B., `Structural analysis of Hypertexts: Identifying Hierarchies and Useful Metrics', ACM Transactions on Office Information Systems, 10 (2), 142-180, April 1992.

[20] Bowman, C. M., Danzig P. B., Manber, U., Schwartz, M. F., `Scalable Internet Resource Discovery', Communications of the ACM,, 37(8), 98-107, ACM Press, 1994.

[21] Brailsford D, Adobe's Acrobat--the Electronic Document Catalyst, Computer Science Technical Report, Nottingham University, UK

[22] Brown H., `Editing Structured Documents--Problems and Solutions', Electronic Publishing: Origination, Dissemination & Design, 5(4), 209-216 .

[23] Brown H., `Standards For Structured Documents', British Computer Society Journal, 32(6), 505-514, (December 1989)

[24] Brown P., `Hypertext: The Way Forward', Document Manipulation and Typesetting, 183-191, Cambridge University Press 1988

[25] Brown P., `Standards for Hypertext Source files: the experience of UNIX Guide', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 49-58

[26] Brown P., `UNIX guide: lessons from ten year's development', Proceedings of the 4th ACM Conference on Hypertext 1992,63-70

[27] Brown, P. J., `Turning Ideas into Products: The Guide System', Hypertext `87 Papers, 33-40, (November 1987)

[28] Bryan M, Standards for Text and Hypermedia Processing, Information Services and Use, 13 (1993), 93-102, IOS Press.

[29] Bryan, M., `SGML: An Authors Guide to the Standard Generalized Markup Language', Addison Wesley Publishing Company, 1988.

[30] Burnard L., Rolling your own with the TEI, Information Services and Use, 13 (1993), 141-154, IOS Press.

[31] Bush, V. `As We May Think', Atlantic Monthly, 101-108, (July 1945)

[32] Campbell, B. and Goodman J. M. `Ham: A General Purpose HyperText Abstract Machine', Communications of the ACM, 31.7, 856-861, (July 1988)

[33] Caras, GJ, `Comparison of Document Abstracts as Sources of Index Terms for Derivative Indexing by Computer', Proceedings of the American Documentation Institute Annual Meeting, 4, 157-161, (1974)

[34] Carlson, P., `The rhetoric of hypertext', Hypermedia, 2, 109-31.

[35] Carr L, `HyperCard Extensions for Multi-Media Databases', Southampton University Department of Computer Science Technical Report, 88-1

[36] Carr L, Barron D, Hall W, Why Use HyTime?, Electronic Publishing: Origination, Dissemination and Design, 2(1), 3-24 (Dec 1993)

[37] Carr L, Davis H, Hall W, Experimenting with HyTime Architectural Forms for Hypertext Interchange, Information Services and Use, 13 (1993), IOS Press.

[38] Carr, L., Rahtz, S., Hall, W., `Experiments with TeX and hyperactivity', TeX90 Conference Proceedings, 13-20, Tugboat 12(1), TeX Users Group, PO Box 9506, Providence, Rhode Island, USA.

[39] Catlin K., Garrett N., Launhardt L., `Hypermedia Templates, An Author's Tool', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,147-160

[40] Cole F., Brown H., `Standards: What can Hypertext Learn form Paper Documents?', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 59-70

[41] Colson, F., Hall, W. Multimedia Teaching with Microcosm-HiDES: Viceroy Mountbatten and the Partition of India. History and Computing 3(2), 89-98, 1991.

[42] Conklin, E. J., `Hypertext: An Introduction and Survey', IEEE Computer, 17-41, (September 1987)

[43] Crane G., `Standards for a Hypermedia Database: Diachronic vs Synchronic Concerns', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990, National Institute of Science and Technology (Special Publication 500-178), 71-81

[44] Croft W., `A Retrieval Model for incorporating Hypertext Links', Proceedings of the ACM Conference on Hypertext 1989, 213-224

[45] Curtice, RM and Jones, PE `Distributional Constraints and the Automatic Selection of an Indexing Vocabulary', Proceedings of the American Documentation Institute Annual Meeting, 4, 152-156, (1967)

[46] Database Publishing Systems Ltd, `DynaText', Product Note, Database Publishing Systems Ltd, 608, Delta Business Park, Great Western Way, Swindon, Wiltshire, UK.

[47] Davis H., Hall W., Heath I., Hill G., Wilkins R., Towards an Integrated Information Environment with Open HyperMedia Systems, Proceeding of the ACM Conference on Hypertext, ACM Press 1992.

[48] Davis. H. C., `Version Control for the Hypermedia Systems', PhD Thesis, Department of Electronics and Computer Science, University of Southampton, Southampton, UK, 1994.

[49] De Bra P., Houben G., Kornatsky Y., `An Extensible Data Model for Hyperdocuments', Proceedings of the 4th ACM Conference on Hypertext 1992,222-231

[50] DeRose, S. J., Durand, D. G., `Making Hypermedia Work: A User's Guide to HyTime', Kluwer Academic Publishers, 1994.

[51] Duncan, E. B., McAleese R. `Qualified citation indexing online?' In: National Online Meeting Proceedings--1982. Compiled by M E Williams and T Hogan. 77-85. Medford (NJ), Learned Information

[52] Duncan, E., `Structuring Knowledge Bases for Designers of Learning Materials', Hypermedia, 1 (1), Taylor Graham, 1989.

[53] Englebart, D. L. and English, W. R. `A Research Centre for Augmenting Human Intellect', AFIPS Conference Proceedings, 33.1

[54] Eysenck, M.W. & Keane, M.T. Cognitive Psychology: a Student's Handbook. Lawrence Erlbaum Associates, Hove, Sussex, 1990.

[55] Fountain A., Hall W., Heath I and Davis H, `MicroCosm: An Open Model for HyperMedia With Dynamic Linking', Southampton University Department of Computer Science Technical Report, 90-7

[56] Fre H., Stieger D., `Making Use of Hypertext Links when Retrieving Information', Proceedings of the 4th ACM Conference on Hypertext 1992,102-111

[57] Frei H., Stieger D., `Making Use of Hypertext Links when Retrieving Information', Proceedings of the 4th ACM Conference on Hypertext 1992,102-111

[58] Furuta R., `An Object-Based Taxonomy for Abstract Structure in Document Models', British Computer Society Journal, 32(6), 494-504, (December 1989)

[59] Furuta R., Plaisant C., Schneiderman B., `A Spectrum of Hypertext Constructions', Hypermedia 1(2), 179-195.

[60] Furuta R., Stotts P., `The Trellis Hypertext Reference Model', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 83-93

[61] Gosling, J., `The NeWS book.'Sun Microsystems.

[62] Halasz F., Schwartz M., `The Dexter Hypertext Reference Model', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 95-133

[63] Halasz, F, `Reflections on Notecards: 7 Issues for the Next Generation of HyperMedia Systems', Communications of the ACM, 31.7, 836-851, (July 1988)

[64] Halasz, F., Moran, T. P. and Trigg R. H. `Notecards in a Nutshell', Proceedings of the 1987 ACM COnference of Human Factors in Computer Systems, 45-52

[65] Hall, W., `Ending the Tyranny of the Button', IEEE Multimedia 1(1), 60-68, Spring 1994.

[66] Hall, W., Carr, L., Davis, H., DeRoure D., `The Microcosm Link Service and its Application to the World-Wide Web', Proceedings of the First International World-Wide Web Conference 1994,25-34

[67] Harnden R., Stringer R., `Theseus', International Federation of Lbrary Assistants,18(3)

[68] Harnden R., Stringer R., `Theseus--A Model for Global Connectivity', Proceedings of UK Systems Society 3rd International Conference 1993, Plenum: New York.

[69] Harnden R., Stringer R., `Theseus--A Way of Doing', accepted for Hewson Report, HGC, Olney, Bucks.

[70] Harnden R., Stringer R., `Theseus--the Evolution of a HyperMedium', Cybernetics and Systems,Vol 24: 255-280

[71] Howell G, `Hypertext Meets Interactive Fiction', HyperText II Conference Paper

[72] Hutchings G., `Patterns of Interaction with a Hypermedia System: A Study of Authors and Users', PhD Thesis, Department of Electronics and Computer Science, University of Southampton, Southampton, UK, 1993.

[73] Hutchings G., Hall W., Colbourn C., `Patterns of Students' Interactions with a Hypermedia System', Interacting With Computers, 295-314, 5(3), Sept 1993, Butterworth-Heinemann

[74] Ichimura S., Matsushita Y., `Another Dimension to Hypermedia Access', Proceedings of the 5th ACM Conference on Hypertext 1993,63-72

[75] International Standards Organisation, Hypermedia/Time-based Structuring Language (HyTime), ISO/IEC Standard 10744, 1992

[76] International Standards Organisation, Standard Generalized Markup Language (SGML), ISO Standard 8879, 1986

[77] Jonassen, D. H., `Semantic Network Elicitation: Tools for Structuring Hypertext', Hypertext: state of the art, 142-152, Intellect: Oxford, 1990

[78] Jonassen, D.H. Hypertext/Hypermedia. Educational Technology Publications Inc., Englewood Cliffs, NJ, 1989.

[79] Jordan D., Russell D., Jensen A.-M. & Rogers R., `Facilitating the Development of Representations in Hypertext with IDE', Proceedings of the ACM Conference on Hypertext 1989, 93-104

[80] Kaindl H., Snaprud M., `Hypertext and Structured Object Representation: A Unifying View', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,345-358

[81] Knopik, T., Ryser, S., `AI methods for structuring hypertext information', Hypertext: state of the art, 224-230, Intellect: Oxford, 1990

[82] Knuth, D. K., `The WEB system of structured documentation', Stanford Computer Science Report 980, Stanford, California, September 1983.

[83] Koegel JF et al, HyOctane: A HyTime Engine for an MMIS, Proceedings of Multimedia 93, ACM Press

[84] Koh, T., Loo, P. L., Chua, T., `On the design of a frame-based hypermedia system', Hypertext: State of the Art, 154-165, Intellect: Oxford, 1990

[85] Lamport L., `The LATEX book', Addison Wesley (1985)

[86] Landow, G. P., `The rhetoric of hypermedia: some rules for authors', Hypermedia and Literary Studies, MIT Press, Cambridge, 1991.

[87] Landow, G., `Writing With and Against A Hypertext System', Seminar, Department of Electronics & Computer Science, University of Southampton, UK, 1994.

[88] Lee. Z., `Computed Links for the Microcosm Hypermedia System', PhD Thesis, Department of Electronics and Computer Science, University of Southampton, Southampton, UK, 1993.

[89] Luhn, H. P., `The Automatic Creation of Literature Abstracts', IBM Journal of Research & Development, 2, 159-165, (1958)

[90] Manoal, F., Heiler, S., Georgakopoulos, D., Hornick, M., Brodie, M., `Distributed Object Management', Technical Report GTE Laboratories Inc., 1992,

[91] Marmann M., Schlageter G., `Towards a Better Support for Hypermedia Structuring: The HYDESIGN model', Proceedings of the 4th ACM Conference on Hypertext 1992, 232-241

[92] Marshall C., Halasz F., Rogers R., Janassen W., `Aquanet: A Hypertext tool to hold your knowledge in place', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,261-274

[93] Marshall C., Rogers R., `Two Years before the Mist: Experiences with Aquanet', Proceedings of the 4th ACM Conference on Hypertext 1992,53-62

[94] Marshall C., Shipman F., `Searching for the Missing Link. Discovering Implicit Structure in Spatial Hypertext', Proceedings of the 5th ACM Conference on Hypertext 1993,217-230

[95] Maurer, H., Tomek, I., `Some aspects of Hypermedia Systems and their treatment in Hyper-G', Wirtschaftsinformatik, 32(2), 187-196, April 1990.

[96] Mayes J., Kibby M., Watson H., `StrathTutor: The Development and Evaluation of a Learning-by-Browsing System on the Macintosh', Computers in Education, 12(1), 221-229, (1988)

[97] McBryan, O., `GENVL and WWWW: Tools for Taming the Web', Proceedings of the First International World-Wide Web Conference 1994,79-90

[98] McCracken D., Akscyn `Experiences with the ZOG HCI System', International Journey of Man-Machine Studies, 121, 293-310, (1984)

[99] Michalak S., Coney M., `Hypertext and the Author/Reader Dialogue', Proceedings of the 5th ACM Conference on Hypertext 1993,174-182

[100] Microsoft Corporation, `Object Linking and Embedding: Version 2.0 ', Microsoft Technical Backgrounder

[101] Microsoft Corporation, Reference to Microsoft Word, Microsoft Corporation

[102] Moulthrop S., `Beyond the Electonic Book: A Critique of Hypertext Rhetoric', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,291-298

[103] Moulthrop S., `Hypertext and the "Hyperreal"', Proceedings of the ACM Conference on Hypertext 1989, 259-263

[104] Moulthrop S., `Towards a Rhetoric of Informating Texts', Proceedings of the 4th ACM Conference on Hypertext 1992,171-189

[105] Nanard J., Nanard M., `Using Structured Types to incorporate Knowledge in Hypertexts', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,329-343

[106] Narnard J., Narnard M., `Should Anchors be Typed too?', Proceedings of the 5th ACM Conference on Hypertext 1993,51-62

[107] Nelson, P., `User Profiling for Normal Text Retrieval', Proceedings of the American Documentation Institute Annual Meeting, 4, 228-295, (1974)

[108] Nelson, T. `Computer Lib', 2nd Edition, Microsoft Press, 1987

[109] Nelson, T. `Literary Machines', published by the author, ISBN 0-89347-056-2

[110] O'Neill J., `Intertextual Reference in Nineteenth Century Mathematics', Science in Context, 6(2), 435-468, (1993)

[111] Object Management Group, `The Common Object Request Broker: Architecture and Specification', Document Number 91.12.1

[112] Parunak H., `Don't Link Me In: Set Based Hypermedia for Taxonomic Reasoning', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,233-242

[113] Parunak H., `Hypercubes Grow on Hypertrees (and other observations from the land of hypersets)', Proceedings of the 5th ACM Conference on Hypertext 1993,73-81

[114] Peehong C et al `The Vortex Document Preparation Environment', Lecture Notes in Computer Science 236, 45-54, (1986)

[115] Price R, MHEG: An Introduction to the future International Standard for Hypermedia Object Interchange, Proceedings of Multimedia 93, ACM Press

[116] Quint V., Vatton I., `Combining Hypertext and Structured Documents in Grif', Proceedings of the 4th ACM Conference on Hypertext 1992,23-32

[117] Rada, R., `Hypertext: From Text to Expertext', McGraw-Hill Book Company, London. 1991.

[118] Rahtz, S. P. Q., Carr, L. A., Hall, W. H., `Creating multimedia documents: hypertext processing', Hypertext: state of the art, 183-193, Intellect: Oxford, 1990

[119] Raymond D., Tompa F., Hypertext and the Oxford English Dictionary, Communications of the ACM, 31(7), 67-83 (1988).

[120] Reinhardt, A., `Managing the New Document', 91-104, Byte, August 1994

[121] Riley V., `An Interchange Format for Hypertext systems: the Intermedia Model', Proceedings of the Hypertext Standardization Workshop Jan 16-18 1990 , National Institute of Science and Technology (Special Publication 500-178), 213-222

[122] Ritchie I., `Hypertext--Moving Towards Large Volumes', British Computer Society Journal, 32(6), 516-523, (December 1989)

[123] Rizk A., Sauter L., `Multicard: An Open Hypermedia System', Proceedings of the 4th ACM Conference on Hypertext 1992,4-10

[124] Rizk A., Streitz N., André J. (Eds.) `Hypertext: Concepts, Systems and Applications', Proceedings of the European Conference on Hypertext, INRIA, France, (1990), Cambridge University Press

[125] Rubinoff, M and Stone, DC `Semantic Tools in Information Retrieval', Proceedings of the American Documentation Institute Annual Meeting, 4, 169-174, (1974)

[126] Rubinstein, R., `Digital Typography: An Introuction to Type and Composition for Computer System Design', Addison Wesley, 1988

[127] Salton G., `Selective Text Utilization and Text Traversal', Proceedings of the 5th ACM Conference on Hypertext 1993,131-144

[128] Schneiderman B, `Designing the User Interface', Addison-Wesley 1987

[129] Schneiderman B. & Kearsley G., `Hypertext Hands-On!', Addison-Wesley 1989

[130] Shackelford D., Smith J., Smith F., `The Architecture and Implementation of a Distributed Hypermedia Storage System', Proceedings of the 5th ACM Conference on Hypertext 1993,1-13

[131] Shackleford, D. E., `The Architecture and Implementation of a Distributed Hypermedia Storage System', Proceedings of the 5th ACM Conference on Hypertext 1993,1-13

[132] Shavelson, R.. `Methods for examining representations of subject matter structure in students' memory', Journal of Research in Science Teaching, 11, 231-249, 1974.

[133] Silverman, C & Halbert, M `Relevancy Revisited--the User as Learner', Proceedings of the American Documentation Institute Annual Meeting, 4, 53-57, (1974)

[134] Steven Newcomb, Neill Kipp, Victoria Newcomb, The "HyTime" Hypermedia/Time-based Document Structuring Language, Communications of the ACM, 34(11), 67-83 (November 1991).

[135] Stotts P., Furuta R., `Hypertext 2000: Databases or Documents?', Electronic Publishing: Origination, Dissemination & Design, 4(2), 119-121, (1991)

[136] Stotts P., Furuta R., Ruiz J., `Hyperdocuments as Automata: Trace-based Browsing Property Verification', Proceedings of the 4th ACM Conference on Hypertext 1992,272-281

[137] Streitz N., Haake J., Hannerman J., Lemke A., Schuler W., Schütt H., Thüring M., `Sepia: A Co-operative Hypermedia Authoring Environment', Proceedings of the 4th ACM Conference on Hypertext 1992,11 -22

[138] Streitz N., Hannermann J., Thüring M., `From Ideas and Arguments to Hyperdocuments', Proceedings of the ACM Conference on Hypertext 1989, 343-364

[139] Thüring M., Haake J., Hannermann J., `What's Eliza doing in the Chinese Room? Incoherent Hypertexts and how to avoid them', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,161-177

[140] Tyler S., `The Said & The Unsaid: Mind, Meaning and Culture', Academic Press (1978)

[141] van Dijk, T.A., `Macrostructures: An Interdisciplinary Study of Global Structures in Discourse, Interaction and Cognition', Hillsdale N.J: L. Erlbaum, 1980

[142] van Dijk, T.A., `Text and Context', Longman's Linguistic Library, London: Longman, 1977 .

[143] Wakeman, L., Jowett, J., `PCTE: The Standard for Open Repositories', Prentice Hall, 1993

[144] Wright P., `Cognitive overheads and prostheses: some issues in evaluating hypertexts', Proceedings of the 3rd ACM Conference on Hypertext 1991 ,1-12

[145] Wright P., Lickorish A., `An empirical comparison of two navigation systems for two hypertexts', HyperText II Conference Paper

[146] Wright, P. Cognitive Overheads and Prostheses: Some Issues in Evaluating Hypertexts. In Hypertext `91: Proceedings of Third ACM Conference on Hypertext, San Antonio, TX December 15-18 1-12, 1991.

[147] Yankelovich N. , Van Dam A., Meyrowitz N., `Reading and Writing the Electronic Book', IEEE Computer, 15-30, (October 1985)

[148] Yankelovich, N. et al, `The Concept and Construction of a Seamless Information Environment', IEEE Computer, 81-96, (Jan 1988)

[149] Zunde, P. `Evaluating and Improving Internal Indexes', Proceedings of the American Documentation Institute Annual Meeting, 4, 86-89, (1974)

APPENDIX 1--A HYPERTEXT HISTORY

A1.1 Memex

The Memex [31] was conceived as a personal library station which held all articles and journals on microfilm. A photographic platen allowed the entry of new pages including texts, longhand notes and photographs. As such, it performed a similar function to the modern video-WORM, a videodisc device where each frame can be written to once only. Each frame was entered into the standard library indexing scheme, by which means it would be located by the reader. Having selected a frame, the reader was able to `tie' it to another relevant frame by operating various controls. This would record each frame's number on an vacant space on the other frame along with a name which the reader would give. Both frame numbers and the name would be entered into the user's private `code book' so that the associative link was available to be browsed in its own right.

Thereafter, calling up such a frame would display a list of link names which the reader may choose to follow by manipulating an appropriate set of control levers. Chaining these `links' together provided `trails' of interest. Bush anticipated electronic encyclopaedias produced with ready-made meshes of these associative trails.

What Bush conceived was a hypertext system with bi-directional named links which mapped frames to frames. The nodes were ordered in a mainly hierarchical fashion (following standard library classification and indexing procedures) with arbitrary cross-referencing, however this ordering was not to be an inherent feature of the system, but a discipline imposed on the initial set of links.

With the exception of links, the information stored was essentially analog in nature, and so no mechanically assisted browsing was possible. The main advantage that the memex offered a library user therefore were a much-increased speed of access and the explicit storage and browsing of trails of thought.

A1.2 NLS/Augment

NLS (the oNLine System) [53] was developed as a complete work environment where all project intercommunication could be done via the computer console.

All of the project information was stored in files, with each file divided into hierarchical statements. Arbitrary reference links were allowed between statements; links were commonly displayed as tagged code-strings inside the text. The console is divided up into multiple `windows' each of which provides a view onto some part of the data. A link was activated by clicking with the mouse on a link tag and then again on the window where the result is to be displayed. Links may be indirect, in which case the tag refers to a statement where the final link address is to be found.

NLS in its original form has no buttons; jumps were performed by activating a link specification (either by clicking on it with the mouse button or by typing its name). Each link specification was composed of 3 parts: the display start which consisted of an address to jump to and a modification of that address, the view filter and a format specification.

The display start was the name of a statement (the first word of the text of that statement), the name of a marker which was pointing to the statement or the statement'sid. The id is a statement's address within the file's hierarchy, for example `6b5' refers to the fifth subsubstatement of the second substatement of statement 6. The address modification was an operation such as `successor', `predecessor', `parent', `eldest child' that yielded a new statement according to the text's structure. The `search' operation allowed a statement to be selected by the text that it contained, with rules given in a content analysis language. The view filter was used to select which statement following the display start appeared to the user. Filtering took place according to the statements' depth in the hierarchy (level filtering) and their content (using the same rules as above). The format was used to restrict the length of the statements that were displayed and to control the space that separated them.

The level-filtering operation is commonly used to provide a summary of a document on the assumption that the more detailed elaborations of an argument appear in the lower reaches of the document's structure. This leads to a distinctively artificial style of writing (an example of which is seen in [53] which was authored with the NLS system).

A1.3 ZOG

ZOG [98] was a network-based, multi-user hypertext system developed at Carnegie-Mellon University and was later expanded into a commercial system called KMS.

ZOG consisted of a database of screen-sized text frames which were viewed one at a time on standard computer terminals. Each frame was structured in its layout and consists of a title, topic information, menu selections and global pads. The `topic information' is the text that holds this frame's knowledge; the title is a one-line summary of this information and gives the frame's unique identification. The menu area gives a set of alternative destinations for finding further information. By convention, the labels on the menu choices are the same as the names of the cards to which they lead. The global pads are at the bottom of the frame and provide a standard set of choices for the reader (for example, go back, go forward and help). All choices are made by selecting the labels with a mouse or by typing the number of the menu item (or initial letter of the pad). Although the database can represent any arbitrary network topology, the ZOG designers express a strong preference for tree-structures as the initial format of the data. In this way frames are designed so that each menu item leads to the children of the current frame and the global pads `next' and `previous' move along the current frame's siblings. The ZOG philosophy stresses that menu items must only perform tree-wise navigation and that cross-reference jumps can only be performed by the pads. Browsing a ZOG database is accomplished purely by selecting menu items and pads--there is no facility to find a frame by satisfying a particular query, but the speed of response which ZOG achieved (a fraction of a second between selecting an item and it being displayed) allowed the users to locate information and select potentially interesting branches very quickly. Each selection is associated with an action. The default action is to go to the referenced frame, but a simple internal programming language allowed more complicated interaction with the user. More sophisticated requirements were fulfilled by `agents', which were external programs invoked by the host computer's operating system and sharing a common convention for taking their input and placing their output in predefined ZOG frames.

A1.4 Intermedia

Intermedia[148] was developed at Brown University in the mid 1980's and built upon three systems that had been developed over the previous twenty years. Though it is now one of the most famous of all hypertext systems it remained an academic showcase rather than a generally available or commercial tool. It was built on the Macintosh computer and adheres to the Macintosh philosophy of a central, unifying user interface. Intermedia provides a set of applications, each of which directly manipulate different media (text, 2D and 3D graphics, scanned images and historical reference charts called `timelines') in a WYSIWYG fashion, but all of which share the same metaphor and commands to accomplish similar abstract operations on each different medium.

The user interface (now very familiar) consisted of a bitmapped display on which many overlapping windows were drawn. A mouse was used to manipulate the various objects portrayed in each of the windows, and a menubar of available commands was displayed at the top of the screen (commands are invoked by choosing them with the mouse or by control-keys on the keyboard). The applications were similar in function to the commercially available MacWrite, MacDraw and MacPaint, but added the capability of creating, editing and following links.

Creating a link was (intentionally) very similar to executing a Cut/Paste operation from the Macintosh desktop metaphor. A source item is selected and the ``Start Link'' command is chosen. The destination item for the link is the selected and the ``Complete Link'' command is chosen. Small icons are displayed at the source and destination to indicate the presence of the link. Double-clicking on such a link anchor point will bring up a new window containing the destination point in the same way that double-clicking on a program icon will start that program. This `seamless' grafting of added functionality onto a pre-existing user interface is emphasised throughout Intermedia.

The link end-points may be attached to any block (contiguous selection of the document), and not just to the document `node' or window which contains them. This is an important distinction as many systems distinguish between the container (frame, window or card) and the information in it by allowing links to be addressed to a container. Subsequent editing or restructuring the document may often lead to many of the links becoming invalidated because they no longer point to the correct information. Intermedia does not suffer from this drawback because links are attached to the information itself.

When working with a particular document, a map window is displayed which shows icons representing the current document and the links that exist to other documents. This map is updated as the current document changes and as new links are added.

Both links and blocks may have property sheets associated with them (analogous to the style sheets that control the physical appearance of a paragraph in a word processor). The property sheets contain fields showing the creator id, the time of creation (automatically filled in), an explainer and a list of keywords (supplied by the author) and may be used as part of the query specification for a search.

Intermedia keeps block and link information separate from the documents themselves, storing it in webs instead. When a browser opens a new web it imposes a new set of links on a family of documents, and allows different users to maintain different perspectives on a set of literature.

A1.5 Notecards

Notecards [63, 64] was developed at Xerox PARC in the mid 1980's and is implemented on Xerox workstations in an extensible Lisp programming environment which supports a high-resolution graphics screen displaying many windows and user interaction via a mouse and keyboard. The notecards of the system are onscreen windows which represent conventional 3" x 5" library cards, but which may be or arbitrary dimensions and may contain many different kinds of information (text, graphics, video or animation).

Notecards supports unidirectional typed links which connect information at the card level. The source of a link is displayed as an icon anchored at a point on the source card. Clicking that icon will display the notecard which is its destination. Each icons has a different appearance according to the nature of the information that is contained on the destination card.

There are two specialised card types: browsers and fileboxes. A filebox can `contain' other cards and fileboxes and is used to impose an intial hierarchical structure on a network of cards (the system imposes the restriction that each notecard must be stored in at least one filebox). A Browser is a card which contains a diagram of a network of notecards. The diagram is created by the system and can be used for navigation purposes, or by direct editing for changing the structure of the network, relinking notecards.

Navigation is mainly achieved by following links in one of three different contexts: a browser, a filebox or a notecard. Apart from these three mechanisms, there is a simple query system which searches for nodes matching the reader's specification.

A1.6 HAM

HAM (Hypertext Abstract Machine) is based on the hypertext engine of the Tektronix Neptune system [32]. It is a `back end' system which allows different hypertext user interfaces to be grafted onto it. Rather analogous to SUN's NFS, it is a transaction-based server communicating via a byte-stream protocol in a network environment, and sits on top of the host computer's file system.

HAM defines various objects which make up a hypertext network together with the operations that can be performed on those objects. HAM's top-level object is the graph, which represents a complete hypertext network and which is partitioned into a tree of contexts. A graph is constituted of nodes joined by links. A node may contain text or binary data and may be subject to automatic version control. Links are used to relate a source and destination node and may too be subject to version controls. The versioning allows the state of any node or link to be queried at any point in its history. Contexts, nodes and links may all have attributes attached which can provide application-specific information. HAM defines the following generic operations which can be applied to any object: create, destroy, get and change. All manipulate an object according to a particular version time. For example, to read a node the `get' operator is passed a reference to a node and a version time and returns the data that the object held at that time. Aside from miscellaneous operations that are only applicable to particular object types there is also a filter operation which takes a version time and a predicate (a test based on attribute values) and returns a list of all the objects in a graph which satisfied that predicate at that time.

HAM has been used to emulate various hypertext systems, but as it is not itself a hypertext system, those entries in the following table which depend on the capabilities front end have been marked as not applicable .

A1.7 HyperTIES

HyperTIES was developed at the University of Maryland and designed as an easy-to-learn tool for browsing instructional databases. Its appearance is similar to that of ZOG with nodes being displayed on terminal-like computer screens. Unlike ZOG, each node may be composed of several screens of information. HyperTIES links are in the form of embedded menus which are simply highlighted words or phrases which form a natural part of the text. There is always one link which is selected, but the `focus of attention' can be shifted by using the terminal's arrow keys. Activating the selected link causes the destination node to replace the current node on the screen. The system keeps track of the path that the user has taken allowing steps to be easily retraced.

HyperTIES unusual method of link selection has been found to be particularly efficient, especially for novice users [128]. The embedded menu technique has also been shown to be effective in an environment where those who are using it are not computer literate, and stands opposed to the practise of creating separate icons to act as link anchors or having a separate set of menu options. HyperTIES also includes an authoring package to allow people with limited computer skills to create and maintain a HyperTIES database. An introduction to Hypertext intended for such users [129] has been published in book form in conjunction with a PC disk containing the same information in a HyperTIES database.

A1.8 HyperCard

HyperCard is probably the most widely used hypertext system as it has been bundled with the Macintosh Operating System from 1988 (see [5] for details) until 1992. Unlike other systems which impose an initial hierarchical structure on the network of nodes that they manage, HyperCard manipulates a linear sequence (stack) of nodes (cards), probably because it is aimed largely at novices. Each card contains three types of objects: bitmap pictures, rectangular fields of text and buttons which are sensitive to mouse events. Typically pictures are used both to give a visual design to the stack (for example to impersonate a spiral-bound notebook) as well as for displaying graphical information (diagrams, maps and scanned images). Fields are used to hold the main body of the stack's textual information and buttons are both placed over the significant phrases in a field to implement hypertext jumps to another card (providing HyperTIES embedded menus) or are placed in some standard location around the periphery of the screen to provide the functionality of ZOG's global pads. A link is made by creating a new button, choosing its ``Link'' option, navigating to the destination card and selecting the ``Link Done'' option. Content-based browsing is also possible through the use of the ``Find'' command.

Behind the scenes HyperCard implements a full function programming language called ``HyperTalk''. The object-oriented language is used to write scripts for the various objects (stacks, cards, fields and buttons) allowing them to respond to various events (e.g. opening a stack, going to a new card, clicking the mouse or pressing a key). Apart from defining new functions with HyperTalk, new routines written in C or Pascal can be linked into the system. This has allowed new media (video, sound and animation) to be incorporated into HyperCard stacks (see [35]).

One of the problems of describing the capabilities of HyperCard (or indeed NoteCards) as a hypertext system is that it is coupled with a highly-extensible programming language and a sufficiently rich set of data types that allow the basic system to emulate more or less any other type of hypertext system. By carefully writing a set of scripts it is possible to make HyperCard act like Guide [27] or NoteCards. Such emulations may suffer from efficiency problems and not provide the level of service that a hand-tuned system will, but it still becomes difficult to draw the line between what HyperCard can and can't do.

Despite the power that HyperTalk provides (allowing computationally-active hypertexts), HyperCard remains difficult to use for pure hypertext, since buttons are anchored to a physical point on a card, not to a region of text within a field. For this reason, even the most minor editing operation on a field will require that all its link anchors be repositioned. Cumbersome work-arounds are possible, but it is for this reason that many claim that ``HyperCard is not hypertext''. Instead, HyperCard's ease of use, its integration with the Macintosh environment and the power of its programming language have led to its success both as a software prototyping tool and an easy-to-learn front-end to highly technical software such as the Oracle database. It is also being used as the basis for help systems for other software.

A1.9 Guide

Guide [27] was developed at the University of Kent as a naïve user's tool for reading documents on a computer and has since been turned into a commercially available product for the Macintosh and IBM PC. It does not partition documents into nodes, but presents them as a single scroll to the user. Its main browsing mechanism is the replacement button (another embedded menu item) which is replaced inline by the material to which it is linked. There is no jumping to a new location, and therefore none of the `lost in hyperspace' problems that bedevil the users of other hypertext systems. An expansion is activated simply by clicking on it with the mouse-pointer and subsequently closed up by clicking a second time. Each of these actions has feed-forward built-in: the cursor changes to distinctive shapes when moved over a button which can be expanded and over an expansion which can be closed. In addition the complete extent of the expansion is visually highlighted when the mouse button is pressed, allowing the reader to change his mind by moving the mouse out of the expansion before releasing the button.

The author enters text as with a normal word processor (the Macintosh implementation provides the usual font-, size- and style-changing commands), and then selects a piece of text and through a menu operation turns it into a replacement button. The button then is then replaced with the default expansion ``..expansion..'' which the author edits. A symmetrical operation exists for allowing the author to select an expansion and to provide a name for the replacement button.

The newly-created replacement button-expansion text pair are now difficult to edit because the text cannot be selected by the mouse. To this end there is a menu command which freezes the state of all buttons allowing their text to be changed.

A typical Guide document is encountered in a `high level summary' form, where each summary is in fact a replacement button. Clicking on each summary will expand it to show more detailed information, with each expansion containing more replacement buttons. In this respect Guide acts as a folding editor, progressively disclosing more and more information at the reader's request. Slight variations of the replacement buttons are used to display relevant information in other windows or to jump to another document, put these are to be rarely used according to the author. As Guide is intended for the novice user it discourages the use of both disorientating hypertext `gotos' and `find' operations, preferring instead to use inline expansions.

Guide has been used in commercial environments, and has been used as an online help tool for the Macintosh PageMaker program.

A1.10 Aquanet

Aquanet [92, 93] is a hypertext system designed to support knowledge structuring tasks and was designed from experience of using Notecards for representing structures of argumentation. Its design incorporates a richer linking model in order to express complex relations, combining hypertext and frame-based representations.

Both nodes and links are examples of objects composed of a set of unordered, named, typed slots containing values which are basic data types (numbers, text, pictures). Link objects have slots whose values are allowed to be other objects and each object type has a different graphical appearance for manipulation on a graphical browser.

The user interface is given by windows which contain graphical views onto the full structure of the hypertext network, a list of all the objects in the network and a view of the full contents of a selected object. The kinds of objects used in the hierarchy and the relationships between them are constrained by a set of schema.

Aquanet's link objects are in fact n-ary relationships, and their graphical relationship reflects their slot-based nature. In order to make links between a set of objects in Aquanet, the slots in the link object are filled in with the names of the linked objects. The literature does not make it clear how this is handled by the user interface; what is clear is that it is the opposite of most linking operations, since instead of selecting a node and applying a link to it, one selects a link and applies a set of nodes to it. A corollary to this is that there are no link anchors or buttons which can appear within a node, instead, when viewing the hypertext its is the nodes which appear inside the link relations. (Links are therefore node-to-node relationships.)

A1.11 Microsoft Word for Windows

Microsoft's wordprocessor, Word for Windows, contains many of the features required tomake a hypertext system. It has an ability to address documents (via file names) or document components (via symbolic bookmarks or character, line and page offsets). It also has buttons which can cause a hypertext `jump' to a named destination or (by virtue of a builtin programming language) can make an arbitrary computation occur. Links are not first-class objects, but nodes may be truly multi-media by virtue of the underlying operating system facilities. Since it is a word-processor, Word has the some of the most comprehensive facilities for nodes, which may be of arbitrary size containing a mixture of formatted text diagrams, pictures, charts. Nodes may not be named (except that whole documents have a file name), nor are there any facilities for attaching attributes.

Word is not meant to be a hypertext system, since it presents a cumbersome interface to its hypertext facilities (authoring a link is rather cumbersome). However, by customising it using the built-in programming language it is possible to make an adequate hypertext user interface.

A1.12 Microsoft Help

Microsoft's `Help' program is a cut-down (read-only) version of its word-processor (above). Documents are prepared with Microsoft Word according to a strict structure, and are then `compiled' into a form which can be used by the help program. It is distributed (free) with Microsoft's Windows environment, and is used to provide online manuals for the operating system and application programs. It has one enhancement over the word-processor in that a path mechanism is provided, but the flexible and familiar `hypertext is a document' paradigm established by the word-processor is abandoned in favour of an interface which emphasises the traditional node and jump paradigm.

A1.13 Acrobat

Acrobat [21] is a commercial document distribution system (from Adobe) which has some primitive hypertext facilities. Acrobat documents are produced as the product of printing to a `virtual Acrobat printer' from within another application and so an Acrobat document may contain any kind of textual or graphical information that can be displayed on a colour raster printing device.

A `node' within Acrobat is equivalent to a printed page, and may be of arbitrary size but the information content of the node is fixed and may not be edited or manipulated in any way except for viewing. A `link' associates a source with a particular view (location and magnification) of a particular destination node. The link source is either a rectangular area of the node (which may be highlighted by outlining) or an entry in a special hierarchical list of bookmarks (usually used as a table of contents).

Links are first-class objects as they are stored explicitly as separate objects within each document, but no graphical browser is provided. Links can only be made to local nodes (within the same document). The nodes are arranged in a sequence, corresponding to the print order of the pages of original data, but a hierarchy can be imposed by the use of bookmarks. Note that there will not usually be a one-to-one mapping between the `logical' section names in the tree of bookmarks and the nodes in the document.

A1.14 Hyper-G

Hyper-G [95] is a hypermedia system which is explicitly based on the nodes and links model. The nodes are stored in distributed databases, and are labelled with sets of attributes (e.g. keywords, comments) which are used for accessing the nodes. The information is arranged hierarchically within each database and allows a default structure to be imposed on the nodes. Links are bidirectional in nature and held in separate link databases. The links are created automatically using information-retrieval techniques and applied to documents dynamically as they are displayed. Tools are also provided for maintaining the integrity and consistency of these link databases as the database of nodes is edited.

APPENDIX 2--TECHNICAL DETAILS

A2.1 Lace

Section 3.1 has described the user's view (both author and reader) of LACE. We will now explain in more detail the implementation of the system.

A2.1.1 The Document Server

The job of the LACE document server is to preside over a database of documents, listening over the network for requests for fragments of those documents. Once a request has selected a particular document, the server must be able to dissect that document into its logical substructure and return the part which the client requested.

At the center of the lace server's world then is a database which describes all the documents which have been published on this node. Each document in turn is a database of logical elements, text and links.

A2.1.1.1 Database Structure

The files that make up the document database are held in the directory /usr/lace/db, the textual source of the database is called docs. This file is constructed similarly to the UNIX password file, i.e. each line is a record and the fields are separated by colons. A typical line might look like that in figure A2.1.

The first and second fields define a nickname and full name for the document, respectively. Either can be used for referring to the document over the network. The third field (currently unused) specifies the permissions associated with this document and the fourth (also unused at this time) gives a comma-separated list of keywords that describe the contents of this document. The fifth field gives the name of the directory in which the document is to be found, the sixth gives the name of the file in that directory. The last field is the type of the document, specifying which medium it is on (e.g. video) or which markup system has been used to represent it (e.g. TEX or troff).

The command updoc puts this database into the UNIX dbm format, in a database called `docs.byname' which is the object that the document server actually deals with.

A2.1.1.2 Server Protocols

The server listens out on a well-known tcp port (doc/6442) for requests for document fragments. The format of these requests is shown in figure A2.2, so that atest:section 1 or nArchaeology Today:Introduction or oWar and Peace are all valid server requests.

test:Humanities Computing:public:humanities,archaeology:/usr/lace:foo.tex:latex

Figure A2.1: A entry from the Lace database

<Request> ::= <RequestType><DocumentSpec>
<RequestType> ::= a | n | o
<DocumentSpec> ::= <DocumentName> | <DocumentName>:<SubdocSpec>
<SubdocSpec> ::= <Subdoc Type><Reference> | <Title>
<Subdoc Type> ::= page | chapter |section | table | figure | footnote ...
<Reference> ::= <Title> \vbar <Number>
<Title> ::= [a-zA-Z.,:;?!"`()]
<Number> ::= <Digits> | <Digits> . <Number>
<Digits> ::= [ 0-9 ]

Figure A2.2: Server Request Protocol

Request type n asks for a named document fragment to be displayed in a new window on the client's NEWS server. This is similar to the o request which overwrites the contents of a window with a named document fragment. Request type a makes the server add an annotation to the named document fragment. The user is prompted for a title and the body of the annotation which is then added to the annotation file associated with the document. Links to and from the annotation are inserted into the document's link file.

All of these requests involve asking for a document by name. This is how the name matching is performed: first the document name is checked against the list of nicknames and then against the list of full document names. The test is case insensitive and compresses all sequences of multiple blanks to one blank. A similar operation is done when matching a subdocument title to the requested title, except that any title which has the requested title as a prefix will satisfy the match (i.e. the request nfoo:intro will be matched by the Introduction of document foo).

The special name me is recognised as referring to `the current document', so that a document may send a request for me:section 2 to display section 2 of itself.

Any request which consists only of a document name will notionally have the subpart :page 1 appended to it. The page is created as a physical structure by the typesetting software rather than an explicit logical structure provided by the author.

A2.1.1.3 Document Format

Every document is considered to be a database of textual elements and link references, however the current version of LACE implements them quite differently. Only requests for the POSTSCRIPT representation of a document are currently supported, so the document is stored as a .ps file (along with the original source file). Separate files are used to store a logical map of the document and a list of the internal and external links that it defines.

Hence, the lace server considers a document to be stored in a file whose name is given in the document database, except that any extension that the filename has is ignored and a .ps extension is appended. This .ps file consists of two elements: a setup procedure which typically defines the set of fonts that the document uses, and a list of pages where each ``page'' is a procedure which contains the instructions to print one page. Each document element in the page array is marked by POSTSCRIPT comments, as can be seen in figure A2.3.

%%Document Setup {
/c-med.240 /Courier 33.208800 TeXPSmakefont def
/h-med.270 /Helvetica 37.359900 TeXPSmakefont def
/t-bol.300 /Times-Bold 41.511000 TeXPSmakefont def
/cmr10.300 /cmr 41.511000 TeXPSmakefont def }
%%Document Setup End
%%Page List
%%Page: 1
{ 1 @bop1
%LACE mark section 1 begin Introduction
t-bol.360 @sf 141 1355 p 49 c 216 1355 p (Intr) s 302 1355 p (oducti) s 435 1355 p (on) s t-rom.300 @sf 141 1442 p (This) s 231 1442 p (report) s 345 1442 p (describes) s 516 1442 p (the) s ...
%LACE mark section 1 end
141 2324 p
%LACE mark section 2 begin Courses
t-bol.360 @sf 141 2373 p 50 c {
%LACE mark footnote 1 begin
cmr6.300 @sf 1641 2484 p 49 c t-rom.240 @sf 1658 2495 p 73 c 1683 2495 p (am) s 1738 2495 p (grateful) s 1856 2495 p (to) s 1896 2495 p (Lou) s 1964 2495 p (Burnard) s
%LACE mark footnote 1 end
} pop
t-rom.300 @sf 1611 2517 p (and) s 1684 2517 p (two) s 141 2566 p (one-term) s 304 2566 p (options) s
}

Figure A2.3: A Lace Document in PostScript Form

As well as the .ps file, there is a map file which catalogues the positions of all the logical document structures in the .ps file (as shown in figure A2.4).

Each line in the file represents one structure in the document. The first and second fields are the byte offsets for the beginning and end of that structure from the start of the .ps file, the third and fourth fields represent the page numbers on which the structure starts and ends, the fifth field is the type of the structure (e.g. page, section, table), the sixth is the ordinal number of that structure (as in table 2 or subsection 2.3) as provided by the formatter, and the remainder of the line is the title of that structure. The name of the map file is the same as the .ps file, with the extension `.map' added to the existing `.ps' extension.

991 2031 0 0 page 0
2031 19591 1 1 page 1
19591 32923 2 2 page 2
13 957 0 0 special 1 Document Setup
12429 17273 1 1 section 1 Introduction
17310 86771 1 6 section 2 Courses
18153 18770 1 1 footnote 1
53045 56597 4 4 subsection 2.1 Equipment
67177 72080 5 5 figure 1 Questionaire given to students
72155 75018 5 5 table 1 Results for all groups of students
82668 86771 5 6 subsection 2.2 Conclusion on courses
@/usr/lace/cstr86-2.ann 0 354 1 1 annotation 1 Three Years On

Figure A2.4: A Map File

An extension to this basic format has been made to allow annotations to be linked in without altering the original document. This can be seen in the final line of the figure, where an extra field (signified by the @ character) has been prepended. This extra field gives the name of a separate file to which the rest of the line refers. Hence the last line in figure A2.4 says that the first annotation (titled ``Three Years On'') to this document is to be found between bytes 0 and 355 in the file called /usr/lace/cstr86-2.ann.

As well as the .ps and .map files, there is a link file associated with the document. This link file lists all the references that exist between the document structures. The example shown in figure A2.5 shows that the introductory section of this document makes reference to section 2, and that annotation 1 is commenting upon section 5. This file is scanned by the document server to provide a menu of `Come Froms' and `GoTos' that relate to the current structure.

A2.1.1.4 Non-textual Documents

The preceding discussion of document structure applies both to textual documents and documents of a different medium. Videodisc documents, for example, have both the map and links files, but no .ps file. Videodiscs have a logical structure associated with them since they are manufactured with divisions into chapters which are analogous to the `tracks' on an LP. Each chapter can be subdivided into different sections of related material. This structure is entered into the map file, where the byte offsets are interpreted as frame numbers and the page offsets are ignored. The links file can hold the logical relationships and references between different parts of the videodisc. Work in this area is not very advanced as yet--how to relate buttons and menus to a non-overlaid video display are two of the main questions that must be faced.

me:section Introduction me:section 2
me:section Humanities Computing me:footnote 1
me:section Humanities Computing me:section 4
me:annotation 1 me:section 5

Figure A2.5: A Links File

A2.1.2 Authoring Environment

LACE allows the author to work in their normal environment, with their own spelling checkers, cross-referencing tools, bibliographic databases and the like. The LACE philosophy is to intercept the document at the formatting stage, and by changing the definitions of the high level macros to produce the effects that the browsing environment requires. This will often mean injecting `markers' into the formatted document to retain the high-level structural information, or changing the way that the system presents indexes or tables and figures. Generally the result of the formatting stage is a file of POSTSCRIPT to be sent to a printer--this is captured and post-processed by LACE to conform to the structure set out above in section A2.1.1.3. This is all accomplished by `hyper' versions of the formatter (hyperlatex and hypertroff) which preprocess the document source before handing it on to the formatter and then the postprocessor. Once this has been done, other programs are invoked to make the map file (mkmap) and the links file (mklinks).

A2.1.2.1 LaTeX Authoring

Hyperlatex adds the lace documentstyle option before invoking LATEX itself. This makes the following changes to the standard LATEX environment

The following document structuring commands are modified to add markers to the POSTSCRIPT, so that LACE can pick them out from the formatted output: abstract, chapter, section, subsection, subsubsection, table, figure, footnote, bibliography, aside

The definition of table, figure and footnote has been changed so that they take up no space on the page, but instead inhabit a separate window that is brought up when pressing a button over a reference to them. For example, a footnote window is brought up when the reader presses the button over the footnote marker in the main body of the text. Similarly a figure window is displayed when the reader presses a button over some text like `see figure 3' which will typically have been created by the new LACElabel and LACEref commands

A new command link is defined. This takes two parameters--the first is a piece of text over which a button will be placed, the second is the LACE address of a document part. For example, the command \link{see section 3}{me:section 3} will make a new window with section 3 of this document pop up when the user clicks on the (invisible) button over the text `see section 3'

A new environment aside is defined. This behaves very much like a footnote: the LATEX fragment

\begin{aside}{Click Me For More Information}

This project has been funded by the World Wildlife Appeal

\end{aside}

will produce an invisible button over the words ``Click Me For More Information'' which, when pressed, will bring up a window with the text that is in the body of the environment.

The label and ref commands have been extended in the form of the LACElabel and LACEref pair. These two commands differ from the standard LATEX forms in that they save not only the number of the current environment, but also its type e.g. section 3.4 or table 2. As well as doing this, the reference text has a link to the item it references.

The table of contents, list of figures and list of tables all have invisible buttons on each of the lines, so that clicking on any line will bring up a window with that part of the document in it.

Any use of the cite command to produce a bibliographic citation in the main body of the text has a button over it that brings up a window containing the full bibliography entry. No generalised facilities yet exist for bringing up the cited document, even if it is a published LACE document.

After invoking hyperlatex, use hyperdvi to produce the .ps file. Then use mkmap and mklinks to produce the auxilliary files.

A2.1.2.2 Man Authoring

Hyperman invokes troff with a special set of macros called manh (derived from the standard man macros). These macros perform the following extra functions:

Define a new command lN which is similar to LATEX's link command. The first parameter is the piece of text which is to have a button appearing over it, the second parameter is a LACE document part which will appear in a new subwindow when the link is activated.

The section command .SH has been modified to add markers to the .ps file, so that LACE can pick out the synopsis, usage description, and bugs sections from the formatted output.

The preprocessing stage of hyperman looks for strings of the form ...cat(1)... and turns these into buttons that are linked to the appropriate manual pages on the assumption that the manual page for foo in section N of the manual is published under the nickname ``foo(N)''.

There is major problem with troff in that any `specials' that are pushed through the formatting stage using the \! notation are floated to the start of the line. This means that it is difficult to mark out the beginning and end of the active area of a button. The way around this is currently to put a special null-length marker string into the text stream. For example, to have a button over the word `foo' which linked the user to section 3 of document `bar', the following troff sequence \\kxfoo\\ky\\h'|\\nxu'BUTTON:foo:bar:section 3\\h'|\\nyu' would be used which puts the horizontal position of the beginning and end of the text `foo' into registers x and y respectively. Then troff backs up to the beginning of the text, writes the string `BUTTON:' followed by the text again, followed by the LACE documentpart name `bar:section 3', and then skips to the endpoint of the original text. The process which post-processes the .ps file takes the position ofthe start of the button to be the current position when the string `BUTTON' is found, and the width of the button to be the width of the string delimited by the next two colons.

A2.1.2.2 Web Authoring

Hyperweave invokes weave on the WEB file, and then TEX on the result, but using a variant form of the webman macros which do the following:

Each section definition is marked so that the postprocessor hast he high-level structural information along with the low-level formatted text

Every reference to another section (e.g.``...this code is used in section 5'' or <Store all the reserved words 64>) has a button attached to it that will bring up that section in a new window.

The index has buttons over the references to each section number

A2.1.3 Browsing Environment

LACE's browsing environment is built on top of the NEWS windowing system. All LACE-specific interactions are defined by the LACEwindow class, which knows how to print TEX and troff documents, as well as providing hypertext buttons and communication with the LACE server to implement the hypertext links.

LACEwindow is a subclass of the system-defined LiteWindow class, which has a number of extra attributes:

Prolog the POSTSCRIPT definitions required to display a particular sort of document i.e. slightly modified versions of the TEX or troff prologs that would be sent to a laser printer. The prolog will vary according to the document type

Title the full title of this document (displayed in the window's title stripe)

Setup the POSTSCRIPT procedure that sets up the environment for this document. Usually defines the set of fonts to be used. Notice that the prolog is specific to a particular document type, whereas the setup is specific to a particular document.

Pages the array of contiguous pages that is currently known by the window. Each page is simply a procedure which when executed will render the page onto the screen.

MaxPage the number of the last page currently held in memory

MinPage the number of the first page currently held in memory

RealMaxPage the number of the last page of the document (usually `0' for the title page)

RealMinPage the number of the first page of the document

Menus the menu object that is displayed when the reader pressesthe mouse's menu button inside a LACE window.

The POSTSCRIPT code shown in figure A2.6 is a typical creation of a new LACEwindow.

%%Document Title
(Humanities Teaching)
%%First & Last Pages
0 7
%%First of This Page Set
1
%%Document Setup
{
/ag-book.360 /AvantGarde-Book 49.813200 TeXPSmakefont def
/p-bol.300 /Palatino-Bold 41.511000 TeXPSmakefont def/p-ita.240
/Palatino-Italic 33.208800 TeXPSmakefont def
/cmmi10.300 /cmmi 41.511000 TeXPSmakefont def
/cmr6.300 /cmr 24.906600 TeXPSmakefont def
/cmsy10.300 /cmsy 41.511000 TeXPSmakefont def
}
%%Menus
[ (Contents =>)
[ (1 Introduction) {(me:page 1) /doLink win send}
(2 Humanities Computing) {(me:page 2) /doLink win send}
(3 Undergraduate Courses) {(me:page 2) /doLink win send}
(3.1 1985-86 course) {(me:page 2) /doLink win send}
(3.2 Student reaction) {(me:page 4) /doLink win send}
] /new DefaultMenu send
(Tables =>)
[ (1 Computer Usage) {(me:table Computer Usage) /doLink win send}
(2 Specific hardware) {(me:table Specific hardware) /doLink win send}
] /new DefaultMenu send
] /new DefaultMenu send
%%Page List
[
%%Page: 1
{1 @bop1p-romsc.300 @sf141 -54 p
(LIST) s241 -54 p (OF) s311 -54 p 84 c334 -54 p (ABLES) sp-rom.300 @sf
...}
]
/new LACETeXWindow send

Figure A2.6: Creating a New Lace Window

The parameters given to a new window are, in order, the document title, the minimum and maximum page numbers of the document, the pagenumber of the first page of the current batch, the document's setup procedure, the menus and the set of pages in the current batch.

The window may know about several pages at a time, because a particular document structure may span several physical pages. This is the reason for having a MaxPage and a RealMaxPage. If the reader requests a new page, the window first checks to see if it is in the current set of pages before passing on a request to the LACE server. This two-level storage is used to improve the response of the system.

Looking carefully at the last line of figure A2.6 shows that the current implementation actually uses different classes for each type of document i.e. a LACETeXWindow class for TEX documents and a LACEtroffWindow for troff documents. These classes are used for efficiency, as they already have the (not insignificant) prologs elaborated in their environments.

The methods that the LACEwindow classes respond to are as follows

/NextPage go to the next page of the document. If the next page is not in the page array, dispatch a request to the LACE server

/PrevPage go to the previous page of the document. If the previous page is not in the page array, dispatch a request to the LACE server

/GotoPage go to the page whose number is passed as a parameter, possibly sending a request to the LACE server.

/RecentPage backtrack through the list of last-read pages. repeatedly issuing this message will return the reader to the `original' page

/CurrentPage returns the number of this page

/newpages takes two parameters: a new page array and the number of the first page in the array. This message is sent back to the window by the LACE server in response to an `o' request.

/dest destroy this window

/doLink takes a LACE address and sends a request to the LACE server to fire up a new window with those contents.

/AddToTrail add this document part to the current trail

/abortLACEserver
kill the LACE server which is presiding over this window

A2.2 Lace '92

Lace '92 is implemented by a pair of HyperCard stacks. The major scripts which control the user interaction and data handling are reproduced here.

A2.2.1 Lace '92 WAIS Information Gatherer

All the information retrieval is performed through the WAIS protocol. In order to simplify the communications requirements, this stack sends simply implements a `remote control keyboard' to send commands to the WAIS programs running on a UNIX host across the network. The program being run is `waisexsearch', a standard UNIX WAIS client which has been modified to return WAIS document references as well as the retrieved text.

on openStack

global connectionID, source, question

push card

go to first cd

put fld "Source" into source

put fld "Question" into question

makeMenus

lock screen

put "Starting connection..."

set the cursor to watch

push card

go to card "Comms Console"

put empty into fld "CommsLog"

put false into tickle

put TCPNameToAddr("152.78.64.8") into theHost

put TCPActiveOpen(theHost, 23, 0) into connectionID

if connectionID contains "fail" then

--put "The Result:" && connectionID

beep

answer "Cannot contact Wynkyn on Ethernet" with "OK"

hide message box

put empty into connectionID

pop card

exit openStack

end if

wait until TCPState(connectionID) is "established"

wait until TCPCharsAvailable(connectionID) > 0

gobble

TCPSend connectionID, numToChar(255)&numToChar(252)&numToChar(24)

repeat

get TCPrecvupto(connectionID, return, 2, empty)

if it contains "login:" then exit repeat

end repeat

TCPSend connectionID, "none"&return

wait for 2 seconds

TCPSend connectionID, "nowayhose"&return

repeat

put TCPrecvupto(connectionID, return, 2, empty) into str

set the cursor to busy

if prompt(str) then exit repeat

put str after fld "CommsLog"

end repeat

pop card

pop card

hide message box

end openStack

on closeStack

global connectionID

put "Closing connection..."

-- TCPSend connectionId, "logout"&return

TCPClose connectionID

-- wait until TCPState(connectionID) is "closed"

TCPRelease connectionID

put empty into connectionID

hide message box

end closeStack

on gobble

global connectionId

get TCPRecvChars(connectionID,TCPCharsAvailable(connectionID))

put it after fld "CommsLog"

end gobble

function prompt str

if str contains "[[section]][[section]][[section]]" then TCPFAIL

put length(str) into max

return char max-1 to max of str is "$ "

end prompt

function doCommand theCmd

global connectionId

TCPSend connectionId, theCmd&return

put empty into theRes

repeat

put TCPrecvupto(connectionID, return, 2, empty) into str

set the cursor to busy

if prompt(str) then exit repeat

put str after theRes

end repeat

--delete line 1 of theRes

delete line 1 of theRes

put superStrip(theRes, lineFeed) into theRes

return theRes

end doCommand

function interpret id

if id is empty then return "no connection"

else return TCPState(id)

end interpret

on makeMenus

reset menubar

create menu "Lace-92"

put "Wais Query,Shell,Interrupt,-,New Partition,-,Export Structure" after menu "Lace-92" with~

menumessages "doWQ,doShell,doIntr,,makeAClass,,exportToFile"

end makeMenus

function wprompt str

if str contains "[[section]][[section]][[section]]" then TCPFAIL

if last char of str is space then delete last char of str

put length(str) into max

return char max-5 to max of str is "quit]:"

end wprompt

on waisSearch sources, words, remote

global connectionId, tickle, numArts, wState

if wstate is "connected" then

waisExit

put "disconnected" into wstate

end if

if remote is empty then

put "/usr/lib/wais/bin/waisexsearch -h wynkyn -p 210 -d"&&word 1 of sources&&words into theCmd

else

put word 1 of sources into db

if db is "cacm" then put "/proj/wais/db/cacm/cacm" into db

put word 3 of sources into host

--put "/usr/lib/wais/bin/wtel@tex"&&db&&host&&words into theCmd

put "/usr/lib/wais/bin/waisexsearch -h"&&host&&"-p 210 -d"&&db&&words into theCmd

end if

lock screen

push card

go to card "Comms Console"

TCPSend connectionId, theCmd&return

put empty into theRes

put 0 into numArts

put empty into buffer

repeat

put TCPCharsAvailable(connectionId) into n

put TCPRecvChars(connectionID,n) after buffer

put length(buffer) into m

put buffer into str

if char m-1 to m of buffer is return&linefeed then

put empty into buffer

else

put number of lines of buffer into lb

if lb > 1 then

delete line 1 to lb-1 of buffer

end if

delete last line of str

end if

if str is empty and wprompt(buffer) then

exit repeat

end if

set the cursor to busy

if str is empty then next repeat

if wprompt(str) then exit repeat

put str after theRes

put str after fld "CommsLog"

select after last char of fld "CommsLog"

if numArts > 0 then

get word 2 of str

delete last char of it

put "Getting headline of article"&&it&"/"&numArts

end if

if str contains "+++- Spad" then put "Calling UK gateway"

if str contains "++ x25 server closed connection" then

put "JANET call failed"

beep

pop card

exit waisSearch

end if

if str contains " +++- bytes/pkts" then

put "search failed"

beep

pop card

exit waisSearch

end if

if str contains "Connected to" then put "Connected to UK gateway"

if str contains "waissearch" then put "Issuing information search"

if str contains "waisexsearch" then put "Issuing information search"

if str contains "SunOS Release" then put "Logged onto to UK gateway"

if str contains "NumberOfRecordsReturned:" then

put offset("NumberOfRecordsReturned:",str) into pos

get char pos to 30000 of str

put "Found"&&word 2 of it&&"relevant articles"

--put word 2 of str into numArts

end if

if numArts > 0 then

get word 2 of str

delete last char of it

put "Getting headline of article"&&it&"/"&numArts

end if

end repeat

hide message box

put true into tickle

--delete line 1 of theRes

delete line 1 of theRes

put superStrip(theRes, lineFeed) into theRes

pop card

put offset("Search Response:", theRes) into pos

if pos is 0 then

beep

else

delete char 1 to pos of theRes

delete first line of theRes

end if

put theRes into fld "Results"

put return&"quit: Finish" after fld "Results"

put empty into fld "Text"

put "connected" into wstate

-- waisExit

end waisSearch

on waisExit

global connectionId

TCPSend connectionId, "0"&return

put TCPrecvupto(connectionID, return, 10, empty) into str

TCPSend connectionId, "q"&return

put TCPrecvupto(connectionID, return, 10, empty) into str

repeat

put TCPrecvupto(connectionID, return, 2, empty) into str

set the cursor to busy

if prompt(str) then exit repeat

end repeat

end waisExit

on doIntr

global connectionId

TCPSend connectionId, numToChar(3)&return

end doIntr

on doShell

global connectionID

ask "What command?" with "date"

if it is empty then exit doShell

answer doCommand(it)

end doShell

on sink

global connectionId

put empty into theRes

repeat

put TCPrecvupto(connectionID, return, 2, empty) into str

set the cursor to busy

if prompt(str) or wprompt(str) then exit repeat

if str != empty then put str

end repeat

put empty

hide message

end sink

function TCPgetLine

global connectionId

put empty into str

repeat

put TCPrecvupto(connectionID, return, 2, empty) after str

if str contains "[[section]][[section]][[section]]" then TCPFAIL

if last char of str is return then exit repeat

end repeat

return str

end TCPgetLine

on TCPFAIL

beep

answer "TCP driver failed. Connection aborted" with "Damn!"

exit to hyperCard

end TCPFAIL

on waisSelect n, dname

put offset(space&n&":",fld "Results") into pos

if pos is 0 then

beep

exit waisSelect

end if

get char pos to pos+1000 of fld "Results"

get first line of it

put word 4 of it into foo

delete char 1 to 6 of foo

if foo is empty then

put word 5 of it into nlines

put word 6 to 100 of it into title

else

put foo into nlines

put word 5 to 100 of it into title

end if

if dname is empty then

WAISchoice n, title, nlines

else

WAISchoice n, title, nlines, dname&":"&superstrip(char 1 to 26 of title,":")

end if

global waisID

put getWaisID() into waisID

end waisSelect

function getwaisID

global connectionId, source

if last word of source is "ecs.soton.ac.uk" then

TCPSend connectionId, "i"&&"/home/wynkyn/2/users/lac/tmp/FOO.ID"&return

else

TCPSend connectionId, "i"&return

end if

put empty into theRes

repeat

put TCPrecvupto(connectionID, return, 2, empty) into str

set the cursor to busy

if wprompt(str) then exit repeat

put str after theRes

end repeat

if last word of source is "ecs.soton.ac.uk" then

put "UNIX_temporary:FOO.ID" into fname

open file fname

read from file fname until eof

close file fname

put it into theRes

else

delete line 1 of theRes

put superStrip(theRes, lineFeed) into theRes

end if

return theRes

end getwaisID

on doWQ

global connectionID, source, question

click at -10,-10

if last word of source is "ecs.soton.ac.uk" then

put empty into remote

else

put true into remote

end if

waisSearch source, question, remote

end doWQ

on startProgress

show cd fld "Backdrop"

show btn "Item Backdrop"

show btn "Group Backdrop"

set the width of btn "Item" to 0

set the left of btn "Item" to the left of btn "Item Backdrop"

show btn "Item"

set the width of btn "Group" to 0

set the left of btn "Group" to the left of btn "Group Backdrop"

show btn "Group"

end startProgress

on progress i, imax, g, gmax

put round(min(i/imax,1.0) * the width of btn "Item Backdrop") into wid

set the rect of btn "Item" to (the left of btn "Item"), the top of btn "Item", (the left of btn "Item") + wid, the bottom of btn "Item"

put round(min(g/gmax,1.0) * the width of btn "Group Backdrop") into wid

set the rect of btn "Group" to (the left of btn "Group"), the top of btn "Group", (the left of btn "Group") + wid, the bottom of btn "Group"

end progress

on endProgress

hide cd fld "Backdrop"

hide btn "Item Backdrop"

hide btn "Group Backdrop"

set the width of btn "Item" to 0

hide btn "Item"

set the width of btn "Group" to 0

hide btn "Group"

end endProgress

on waisChoice n, title, nlines, fileIt

global connectionId, numArts, numLines, source

if last word of source is "ecs.soton.ac.uk" then

TCPSend connectionId, n&&"/home/wynkyn/2/users/lac/tmp/FOO.TMP"&return

else

TCPSend connectionId, n&return

startProgress

end if

get TCPgetLine()

put empty into theRes

put empty into fld "Text"

put 0 into l

put empty into buffer

repeat

put TCPCharsAvailable(connectionId) into n

put TCPRecvChars(connectionID,n) after buffer

put length(buffer) into m

put buffer into str

if char m-1 to m of buffer is return&linefeed then

put empty into buffer

else

put number of lines of buffer into lb

if lb > 1 then

delete line 1 to lb-1 of buffer

end if

delete last line of str

end if

if str is empty and wprompt(buffer) then

exit repeat

else if str is empty and buffer is (linefeed & "$ ") then

put "ERROR: crashed back to local host"

endProgress

beep

--pop card

exit waisChoice

end if

set the cursor to busy

if str is empty then next repeat

if wprompt(str) then exit repeat

put str after theRes

-- put str after fld "CommsLog"

-- select after last char of fld "CommsLog"

add number of lines of str to l

if fileIt is empty then

progress l,nlines, l,nlines

else

progress l,nlines, 1,numArts

end if

if str contains "++ x25 server closed connection" then

put "JANET call failed"

endProgress

beep

--pop card

exit waisChoice

else if str contains "+++- bytes/pkts" then

put "nfs.tn connection timed out"

endProgress

beep

--pop card

exit waisChoice

else if length(fld "Text") < 4000 then

put superStrip(str, lineFeed) after fld "Text"

end if

end repeat

if last word of source is "ecs.soton.ac.uk" then

put "UNIX_temporary:FOO.TMP" into fname

put empty into theRes

open file fname

repeat

read from file fname for 16000

if it is empty then exit repeat

put it after theRes

end repeat

close file fname

else

endProgress

--pop card

--delete line 1 of theRes

end if

put superStrip(theRes, lineFeed) into theRes

if fileIt is empty then

if length(theRes)>30000 then

put char 1 to 30000 of theRes into fld "Text"

answer "Only first 30K of article will be displayed. Store whole article in a file?" with "No" or "OK"

if it is "OK" then

ask file "Store article where?"

if it != empty then

put it into fname

open file fname

repeat

write char 1 to 10000 of theRes to file fname

delete char 1 to 10000 of theRes

if theRes is empty then exit repeat

end repeat

close file fname

end if

end if

else

put theRes into fld "Text"

end if

else

put fileIt into fname

open file fname

repeat

write char 1 to 10000 of theRes to file fname

delete char 1 to 10000 of theRes

if theRes is empty then exit repeat

end repeat

close file fname

end if

end waisChoice

A2.2.2 Lace '92 Information Organiser

The scripts in this stack perform three main tasks: (i) to manipulate the fields of text, allowing them to be moved, resized and reformatted, (ii) to deal with the various partitions (classes) of information, maintaining separate cards to display them and (iii) saving the information to disk in LaTeX or SGML format.

on Z

global passKeys

put false into passKeys

end Z

on keyDown which

global passKeys

if passKeys is true then

pass keydown

exit keyDown

end if

if the target contains "card" then

if which is numToChar(8) then put "del" into act

else if which is "+" then put "bigger" into act

else if which is "-" then put "smaller" into act

else if which is "=" then put "normal" into act

else if which is "b" then put "bold" into act

else if which is "i" then put "italic" into act

else if which is "p" then put "plain" into act

else if which is "r" then put "reformat" into act

else if which is "s" then put "scroll" into act

else if which is "e" then put "edit" into act

else if which is "Z" then

put true into passKeys

exit KeyDown

else

beep

exit keyDown

end if

put number of cd flds into mc

repeat with d=2 to number of cd flds

put mc-d+2 into c

if the mouseLoc is within the rect of cd fld c then

if act is "del" then

set the cursor to watch

lock screen

choose field tool

select cd fld c

doMenu "Cut Field"

choose browse tool

unlock screen

else if act is "bigger" then

get the textSize of cd fld c

if it is 6 then get 7

else if it is 7 then get 8

else if it is 8 then get 9

else if it is 9 then get 10

else if it is 10 then get 12

else if it is 12 then get 14

else if it is 14 then get 18

else if it is 18 then get 24

else get it

set the textSize of cd fld c to it

else if act is "smaller" then

get the textSize of cd fld c

if it is 7 then get 6

else if it is 8 then get 7

else if it is 9 then get 8

else if it is 10 then get 9

else if it is 12 then get 10

else if it is 14 then get 12

else if it is 18 then get 14

else if it is 24 then get 18

else get it

set the textSize of cd fld c to it

else if act is "scroll" then

lock screen

if the style of cd fld c is "scrolling" then

put the rect of cd fld c into r

set the rect of cd fld c to item 1 of r, item 2 of r, (item 3 of r)-16, item 4 of r

set the style of cd fld c to rectangle

else

put the rect of cd fld c into r

set the rect of cd fld c to item 1 of r, item 2 of r, (item 3 of r)+16, item 4 of r

set the style of cd fld c to scrolling

end if

unlock screen

else if act is "reformat" then

lock screen

put reformat(cd fld c) into cd fld c

set the textStyle of line 1 of cd fld c to bold

unlock screen

else if act is "normal" then

set the textSize of cd fld c to 10

else if act is in "bold italic plain" then

set the textStyle of cd fld c to act

else if act is "edit" then

get the script of cd fld c

put line 2 of it into offs

delete line 1 to 3 of it

delete last line of it

global xx

put getDocFromID(it,1000) into xx

end if

exit keyDown

end if

end repeat

beep

else

pass keyDown

end if

end keyDown

function reformat s

put line 1 of s & return&return into XX

repeat with c=2 to number of lines of s

set the cursor to busy

if line c of s is empty then

if last line of XX != empty then

put return after XX

end if

else if first char of line c of s is space then

if last line of XX != empty then

put space after XX

end if

put word 1 to 999 of line c of s after XX

else if first char of line c of s is "\" then put return&line c of s &return after XX

else

if last line of XX != empty then

put space after XX

end if

put line c of s after XX

end if

end repeat

return XX

end reformat

on drawAClass c, r

--lock screen

put (the pattern)&&(the filled)&&(the lineSize)&&(the centered) into old

set the pattern to 1

set the filled to true

set the lineSize to 4

set the centered to false

choose select tool

drag from item 1 of r, (item 2 of r)-10 to item 3 of r, item 4 of r

doMenu "Clear Picture"

choose rectangle tool

drag from item 1 of r, item 2 of r to item 3 of r, item 4 of r

choose select tool

drag from item 1 of r, item 2 of r to item 3 of r, item 4 of r

repeat 20 times

doMenu "Darken"

end repeat

choose rectangle tool

set the centered to true

drag from round(((item 1 of r)+(item 3 of r))/2), item 2 of r to~

round(((item 1 of r)+(item 3 of r))/2)+60, item 2 of r+10

if there is a btn c then

select btn c

doMenu "Clear Button"

end if

choose Button Tool

doMenu "New Button"

set the name of btn "New Button" to c

set the style of btn c to transparent

set the rect of btn c to round(((item 1 of r)+(item 3 of r))/2)-60, (item 2 of r)-10,~

round(((item 1 of r)+(item 3 of r))/2)+60, item 2 of r+10

set the textAlign of btn c to center

set the textFont of btn c to helvetica

set the textSize of btn c to 14

set the textHeight of btn c to the textSize

set the textStyle of btn c to bold

set the autoHilite of btn c to true

choose browse tool

unlock screen

set the pattern to word 1 of old

set the filled to word 2 of old

set the lineSize to word 3 of old

set the centered to word 4 of old

end drawAClass

on makeAClass

global classlist, classRegs

ask "Name of class" with "nothing"

if it is empty or it is nothing then exit makeAClass

put it into newClass

if the short name of this cd is not "Overview"

then

put (the short name of this card)&"." before newClass

end if

put getRect() into newReg

put newClass & return after classList

put newReg&return after classRegs

put classList into fld "classList"

put classRegs into fld "classRegs"

drawAClass newClass, newReg

end makeAClass

function getRect

set the cursor to arrow

lock screen

doMenu "New Button"

put the number of btns into tmp

set the style of btn tmp to transparent

set the showName of btn tmp to false

set the width of btn tmp to 2

set the height of btn tmp to 2

hide btn tmp

unlock screen

wait until the mouse is down

set the topLeft of btn tmp to the clickLoc

show btn tmp

put item 1 of the clickLoc into sx

put item 2 of the clickLoc into sy

repeat until the mouse is up

set the rect of btn tmp to sx, sy, item 1 of the mouseLoc, item 2 of the mouseLoc

end repeat

get the rect of btn tmp

select btn tmp

doMenu "Clear Button"

choose browse tool

return it

end getRect

function membersofClass name

global classList, classRegs

repeat with c=1 to number of lines of classList

if line c of classList is name then exit repeat

end repeat

if line c of classList != name then return empty

put line c of classRegs into r

put empty into res

repeat with c=2 to number of cd flds

if the loc of cd fld c is within r then put c&space after res

end repeat

return res

end membersOfClass

function namesOf members

global classList

put empty into res

repeat with c=1 to number of words of members

put (the short name of cd fld (word c of members))&return after res

end repeat

return res

end namesOf

on mouseUp

if the target contains "card button" then

if the short name of this cd is "Overview" then

selectClass the short name of the target

else

go to card "Overview"

end if

else pass mouseUp

end mouseUp

on mouseDown

if "card field" is not in the target then exit mouseDown

global mdTick

if mdTick != empty then

if the ticks - mdTick < 20 then

doubleClick

exit mouseDown

end if

end if

put the ticks into mdTick

get the clickLoc

put item 1 of it into sx

put item 2 of it into sy

if (the right of the target - sx)<20 and (the bottom of the target - sy)<20 then put "size" into act

else put move into act

repeat until the mouse is up

get the mouseLoc

put item 1 of it into x

put item 2 of it into y

if x is sx and y is sy then next repeat

if act is "move" then

set the loc of the target to (item 1 of the loc of the target)+x-sx,~

(item 2 of the loc of the target)+y-sy

else

set the rect of the target to item 1 of the rect of the target,~

item 2 of the rect of the target,~

(item 3 of the rect of the target)+x-sx,~

(item 4 of the rect of the target)+y-sy

end if

put x into sx

put y into sy

end repeat

end mouseDown

on selectClass which

put namesOf(membersOfClass(which)) into m

answer m

zoomInOnClass which

end selectClass

on zoominonClass which

if there is a card which then go to card which

else

doMenu "New Card"

set the name of this card to which

doMenu "New Field"

hide cd fld 1

go back

put membersOfClass(which) into fnos

repeat with c=1 to number of words of fnos

select cd fld (word c of fnos)

doMenu "Copy Field"

go to card which

type V with commandKey, shiftKey

go back

end repeat

choose browse tool

go to cd which

put (the pattern)&&(the filled)&&(the lineSize)&&(the centered) into old

set the pattern to 1

set the filled to true

set the lineSize to 4

set the centered to false

choose rectangle tool

put the rect of the card window into r

drag from round(((item 1 of r)+(item 3 of r))/2)-120, item 2 of r+30 to~

round(((item 1 of r)+(item 3 of r))/2)+120, item 2 of r+60

choose Button Tool

doMenu "New Button"

set the name of btn "New Button" to which

set the style of btn which to transparent

set the rect of btn which to round(((item 1 of r)+(item 3 of r))/2)-120, item 2 of r+30,~

round(((item 1 of r)+(item 3 of r))/2)+120, item 2 of r+60

set the textAlign of btn which to center

set the textFont of btn which to helvetica

set the textSize of btn which to 24

set the textHeight of btn which to the textSize

set the textStyle of btn which to bold

set the autoHilite of btn which to true

set the pattern to word 1 of old

set the filled to word 2 of old

set the lineSize to word 3 of old

set the centered to word 4 of old

choose browse tool

end if

end zoomInOnClass

on exportToFile

global classList, source

ask file "Name the structured file"

if it is empt then exit exportToFile

put it into fname

set the cursor to watch

put "<document>"&return into foo

put empty into fs

repeat with c=1 to number of lines of classList

put line c of classList into cname

put "<section>"&cname&"</>"&return after foo

put membersOfClass(cname) into fnos

put fnos after fs

repeat with d= 1 to number of words of fnos

put "<quote header="&quote&line 1 of (cd fld (word d of fnos))&quote&&"link="&quote&"WAIS("&source&")"&qote&">"&return after foo

if line 2 of (cd fld (word d of fnos)) is empty then put line 3 to 30000 of (cd fld (word d of fnos)) after foo

else put line 2 to 30000 of (cd fld (word d of fnos)) after foo

put "</quote>"&return&return after foo

end repeat

put return after foo

end repeat

put empty into missing

repeat with c=2 to number of cd flds

if c is not in fs then put c&space after missing

end repeat

if missing != empty then

put "<section>Miscellaneous</>"&return after foo

put missing into fnos

put fnos after fs

repeat with d= 1 to number of words of fnos

put "<quote header="&quote&line 1 of (cd fld (word d of fnos))&quote&&"link="&quote&"WAIS("&source&")"&qote&">"&return after foo

if line 2 of (cd fld (word d of fnos)) is empty then put line 3 to 30000 of (cd fld (word d of fnos)) after foo

else put line 2 to 30000 of (cd fld (word d of fnos)) after foo

put "</quote>"&return&return after foo

end repeat

put return after foo

end if

put return&"</document>"&return after foo

open file fname

write foo to file fname

close file fname

-- answer foo

end exportToFile

function getDocFromID id, nlines

global connectionId, source

if last word of source is "ecs.soton.ac.uk" then

get "/usr/lib/wais/bin/getdoc > /home/wynkyn/2/users/lac/tmp/FOO.DOC"

else

get "/usr/lib/wais/bin/getdoc"

end if

TCPSend connectionId, it&return

put length(it)+1 into l

get TCPstate(connectionID)

put "Requesting document"

repeat with c=1 to number of lines of id

put line c of id into theLine

repeat

if theline is empty then exit repeat

TCPSend connectionId, word 1 to 20 of theLine & return

get TCPRecvUpTo(connectionID, return, 10, empty)

if it contains "[[section]][[section]][[section]]" then

TCPFAIL

end if

set the cursor to busy

put length(word 1 to 20 of theLine)+1 into l

delete word 1 to 20 of theLine

end repeat

end repeat

--TCPSend connectionId, numToChar(4)&return

repeat

set the cursor to busy

get TCPRecvUpTo(connectionID, return, 10, empty)

if it contains "[[section]][[section]][[section]]" then

TCPFAIL

end if

if it contains "done." then

exit repeat

end if

end repeat

put empty into theRes

wait 1 second

repeat

put TCPCharsAvailable(connectionId) into n

if n = 0 then exit repeat

get TCPRecvUpTo(connectionID,return,2,empty)

end repeat

put "Retrieving document"

put 0 into l

if last word of source is "ecs.soton.ac.uk" then

put "UNIX_temporary:FOO.DOC" into fname

open file fname

read from file fname until eof

close file fname

put it into theRes

open "UNIX_temporary:FOO.DOC" with "Giorgio:Applications:BBEdit 2.1.1 [[florin]]:BBEdit"

else

startProgress

put empty into buffer

repeat

put TCPCharsAvailable(connectionId) into n

put TCPRecvChars(connectionID,n) after buffer

put length(buffer) into m

put buffer into str

if char m-1 to m of buffer is return&linefeed then

put empty into buffer

else

put number of lines of buffer into lb

if lb > 1 then

delete line 1 to lb-1 of buffer

end if

delete last line of str

end if

if str is empty and wprompt(buffer) then

exit repeat

else if str is empty and buffer is (linefeed & "$ ") then

put "ERROR: crashed back to local host"

endProgress

beep

--pop card

exit getDocFromID

end if

set the cursor to busy

if str is empty then next repeat

if prompt(str) then exit repeat

put str after theRes

add number of lines of str to l

progress l,nlines, l,nlines

if str contains "++ x25 server closed connection" then

put "JANET call failed"

endProgress

beep

--pop card

exit getDocFromID

else if str contains "+++- bytes/pkts" then

put "nfs.tn connection timed out"

endProgress

beep

--pop card

exit getDocFromID

end if

end repeat

endProgress

--pop card

delete line 1 of theRes

put superStrip(theRes, lineFeed) into theRes

end if

put empty

hide message box

return theRes

end getDocFromID

on startProgress

end startProgress

on endProgress

end endProgress

on progress

put the params

end progress

A2.3 Lace '93

Lace '93 currently consists of a document containment architecture (demonstrated below) and a simple program to parse the specification of document objects nad their relationships that it contains.

A2.3.1 Lace '93 Example Document

<lace93>

<docobjects>

<object id=start><H1>My little document</H1>

This document is about dinosaurs.</object>

<object id=dino1>

<WWW>http://www.hcc.hawaii.edu/dinos/dinos.1.html</WWW></object&g;

<object id=dino>

<ruler dest=dino1>/For the/ /in the world./</ruler></object>

<object id=o1>Here's some inline document</object>

<object id=o2 type="application/postscript">

0 0 moveto 100 100 lineto stroke

(here's some inline postscript, i.e. formatted document) show

</object>

<object id=f1><WWW>http://bright/cs/papers/www94.html</WWW></object>

<object id=f2><ruler dest=f1>/<H1>/ /^<H1>/</></object>

<object id=o3><ruler dest=f1>96 3</></object>

<object id=o4>Glossary definition of the term hypermedia</object>

<object id=o5><WWW>http://bright.ecs.soton.ac.uk/</WWW></object>

<object id=o6><contents>hypermedia</></object>

<object id=p1><ruler dest=f1>/<H1>/ /^<H1>/</></object>

<object id=p2><file>Makefile</></object>

<object id=p3><file>colphoto.jpg</></object>

<object id=o7>Decription of a graphic</object>

<object id=pic1><IMG SRC="http://bright/forest.gif"></object>

<object id=pic2><IMG SRC="http://bright/univ.gif"></object>

<object id=pic3><IMG SRC="http://bright/bargate.gif"></object>

</docobjects>

<docrelationships>

<includes objs="main start">

<includes objs="main dino">

<summary objs="o1 o3">

<quote objs="o4 o3">

<summary objs="o4 o5">

<generic objs="o6 o5">

<alternative objs="start o1">

<includes objs="main f2">

<imagechoice objs="o7 p1 p2 p3">

<includes objs="main pic1">

<alternative objs="pic1 pic2">

<alternative objs="pic1 pic3">

</docrelationships>

</lace93>

A2.3.2 Lace '93 Parser Code

Lex Code

%Start DR

%%

"<lace93>" MARKUP(SLACE93);

"</lace93>" MARKUP(ELACE93);

"<docobjects>" MARKUP(SDOCOBJ);

"</docobjects>" MARKUP(EDOCOBJ);

"<docrelationships>" { BEGIN DR; MARKUP(SDOCRELN); }

"</docrelationships>" { BEGIN 0; MARKUP(EDOCRELN); }

"<relationships>" MARKUP(SRELNS);

"</relationships>" MARKUP(ERELNS);

"<relationship>" MARKUP(SRELN);

"</relationship>" MARKUP(ERELN);

"<remark>" MARKUP(SREM);

"</remark>" MARKUP(EREM);

"<object"[ \t]*id=[^>]*">"

{

extern char *strchr();

char *s=strchr(yytext,'=');

strcpy(idval,s+1);

s=strchr(idval,'>');

*s='\0';

MARKUP(SOBJ);

}

"</object>" MARKUP(EOBJ);

<DR>"<"[^>]*">" MARKUP(RELATIONSHIP);

. { if(!indata && isspace(*yytext)) {}

else{ indata=1;

strcpy(textfrag,yytext); return(TEXT);}}

\n { if(!indata && isspace(*yytext)) {}

else{ indata=1;

strcpy(textfrag,yytext); return(TEXT);}}

%%

Yacc Code

%token SLACE93 ELACE93 SDOCOBJ EDOCOBJ SDOCRELN EDOCRELN RELATIONSHIP SRELNS ERELNS SRELN ERELN SREM EREM SOBJ EOBJ USTAGO UOTAGO UTAGC TEXT

%{

#define MAXSTR 1024

#define MARKUP(x) indata=0;return(x)

char idval[MAXSTR], textfrag[MAXSTR], textval[MAXSTR], contentval[MAXSTR], relval[MAXSTR], ob1val[MAXSTR], ob2val[MAXSTR];

int indata;

extern int numobjs, numrels;

extern char *obspec[], *id[];

extern char *relname[], *relob1[], *relob2[];

%}

%%

lace93: SLACE93 objects docrelns ELACE93 ;

objects: SDOCOBJ objectplus EDOCOBJ ;

objectplus: object | objectplus object ;

object: SOBJ data EOBJ

{numobjs++;

obspec[numobjs]=(char *)malloc(strlen(contentval)+1);

id[numobjs]=(char *)malloc(strlen(idval)+1);

strcpy(obspec[numobjs],contentval);

strcpy(id[numobjs],idval);

};

docrelns: SDOCRELN docrelnplus EDOCRELN ;

docrelnplus: docreln | docrelnplus docreln ;

docreln: RELATIONSHIP

{ sscanf(yytext, "<%s objs=\"%[^ ] %[^\"]\">", relval, ob1val, ob2val);

numrels++;

relname[numrels]=(char *)malloc(strlen(relval)+1);

strcpy(relname[numrels],relval);

relob1[numrels]=(char *)malloc(strlen(ob1val)+1);

strcpy(relob1[numrels],ob1val);

relob2[numrels]=(char *)malloc(strlen(ob2val)+2);

strcpy(relob2[numrels],ob2val);

}

;

data : string {strcpy(contentval,textval);} |

data string {strcpy(contentval,textval);}

;

string : TEXT {strcpy(textval,textfrag);} |

string TEXT {strcat(textval,textfrag);}

;

%%

#include "lex.yy.c"

yyerror(s)

char *s;{

fprintf(stderr,"ERROR: %s\n",s);

}

A2.3.3 Lace '93 Application

Main Module

#include <stdio.h>

#define MAXSTR 1024

#define MAXOBJS 100

#define MAXRELS 100

int numobjs=0, numrels=0;

char *obspec[MAXOBJS], *object[MAXOBJS], *id[MAXOBJS];

char *relname[MAXRELS], *relob1[MAXRELS], *relob2[MAXRELS];

extern char *resolve();

extern void doobj();

char *getobject(c)

int c;{

if(object[c]==NULL) object[c]=resolve(obspec[c]);

return(object[c]);

}

char *objbyid(id)

char *id;{

if(strcmp(id,"main")==0)return("");

else return(getobject(onumbyid(id)));

}

int onumbyid(i)

char *i;{

int c;

for(c=1; c<=numobjs; c++){

if(strcmp(id[c],i)==0)return(c);

}

return(0);

}

char *alternative_of(id)

char *id;{

int c;

for(c=1; c<=numrels; c++)

if((strcmp(relob1[c],id)==0) &&

(strcmp(relname[c],"alternative")==0)){

fprintf(stderr,"Would you rather see %s than %s?\n",

relob2[c], id);

}

return(id);

}

main(){

int c;

yyparse();

obspec[0]=object[0]=NULL;

/****

for(c=1; c<=numobjs; c++)

printf("OBSPEC %d (id %s) = '%s'\n", c, id[c], obspec[c]);

****/

for(c=1; c<=numobjs; c++)

object[c]==NULL;

/****

for(c=1; c<=numobjs; c++)

if(object[c]==NULL)object[c]=resolve(obspec[c]);

for(c=1; c<=numobjs; c++)

printf("OBJECT %d (id %s) = '%s'\n", c, id[c], object[c]);

for(c=1; c<=numrels; c++)

printf("RELATIONSHIP %d = '%s', %s->%s\n", c, relname[c],

relob1[c], relob2[c]);

****/

/** To elaborate the document we find everything that relates to

main **/

doobj("main");

/****

for(c=1; c<=numrels; c++)

if(strcmp(relob1[c],"main")==0){

if(strcmp(relname[c],"contains")==0){

printf("%s",objbyid(alternative_of(relob2[c])));

}

else printf("****main--%s-->%s\n",relname[c],relob2[c]);

}

****/

}

Object Resolver Module

#include <stdio.h>

#include <string.h>

#define MAXSTR 1024

extern char *wwwgrab(), *newsgrab(), *objbyid();

char *filegrab(s)

char *s;{

int fd;

long len;

char *retbuf;

if((fd=open(s,0))<0)return(NULL);

len=lseek(fd,0L,2);

retbuf=(char *)malloc(len+1);

lseek(fd,0L,0);

read(fd,retbuf,len);

close(fd);

retbuf[len]='\0';

return(retbuf);

}

char *subpartstr(s, a, b)

char *s;

char *a, *b;{

char *retbuf, *p1, *p2;

int siz;

int notend=0;

if(*b=='^'){

b++;

notend++;

}

p1=strstr(s,a);

if(p1==NULL)return(NULL);

p2=strstr(p1+strlen(a),b);

if(p2==NULL)return(p1);

if(notend) siz=p2-p1;

else siz=p2+strlen(b)-1-p1+1;

retbuf=(char *)malloc(siz+1);

strncpy(retbuf,p1,siz);

retbuf[siz]='\0';

return(retbuf);

}

char *subpart(s, a, b)

char *s;

int a, b;{

char *retbuf;

int siz;

if(b<0)siz=strlen(s)+b-a+2;

else siz=b;

retbuf=(char *)malloc(siz+1);

strncpy(retbuf,s+a-1,siz);

retbuf[siz]='\0';

return(retbuf);

}

char *resolve(s)

char *s;{

char name[MAXSTR], str1[MAXSTR], str2[MAXSTR];

char *t;

int first, last;

if(*s!='<')return s;

if(sscanf(s,"<WWW>http:%s",name)==1){

t=strstr(name,"</");

*t='\0';

return(wwwgrab(name));

}

else if(sscanf(s,"<WWW>news:%s",name)==1){

t=strstr(name,"</");

*t='\0';

return(newsgrab(name));

}

else if(sscanf(s,"<news>%s",name)==1){

t=strstr(name,"</");

*t='\0';

return(newsgrab(name));

}

else if(sscanf(s,"<file>%s",name)==1){

t=strstr(name,"</");

*t='\0';

return(filegrab(name));

}

else if(sscanf(s,"<ruler dest=%[a-zA-Z0-9]>%d %d",

name, &first, &last)==3){

return(subpart(objbyid(name),first,last));

}

else if(sscanf(s,"<ruler dest=%[a-zA-Z0-9]>/%[^/]/ /%[^/]/",

name, str1, str2)==3){

return(subpartstr(objbyid(name),str1,str2));

}

else return(s);

}

A2.4 Text Retrieval Experiments

A2.4.1 List of Words and Frequencies used in Text Retrieval Experiments

2975 the

1173 to

1140 of

1092 in

1019 a

715 and

468 has

449 said

380 for

372 was

358 on

355 mr

349 is

321 have

289 he

283 by

273 be

255 that

247 at

242 were

236 been

234 are

221 it

217 with

197 from

191 will

186 an

179 police

167 which

163 after

154 government

145 had

142 they

138 as

138 would

133 his

126 not

123 people

121 two

115 but

115 says

114 their

113 party

111 east

103 who

101 its

101 london

101 today

100 new

99 being

99 more

96 last

92 over

86 this

86 west

80 british

80 minister

77 also

77 ambulance

75 secretary

73 president

72 there

70 britain

70 german

70 out

69 one

68 about

66 first

66 than

63 into

62 leader

61 pay

58 all

58 man

58 up

57 expected

57 killed

57 million

57 no

57 year

55 against

55 us

54 germany

54 mrs

53 called

53 country

52 she

51 emergency

51 when

50 action

50 three

50 union

49 prime

49 talks

48 could

47 calls

47 labour

47 meeting

47 report

46 before

46 service

46 should

45 died

45 general

45 south

44 dispute

44 foreign

44 north

44 say

44 soviet

43 court

43 group

43 health

43 now

42 between

42 national

42 spokesman

41 crews

41 four

41 home

41 week

40 other

39 election

39 her

39 if

39 workers

38 arrested

38 five

38 found

38 injured

38 night

38 years

37 next

37 office

37 since

36 announced

36 nuclear

36 only

36 security

36 yesterday

35 during

34 bbc

34 communist

34 john

34 made

34 month

34 some

34 unions

34 work

33 any

33 former

33 near

33 taken

32 because

32 england

32 european

32 hospital

32 reported

32 six

32 water

31 car

31 power

31 thatcher

30 children

30 company

30 council

30 leaders

30 part

30 told

29 army

29 chief

29 members

29 plans

28 back

28 drug

28 germans

28 murder

28 trade

27 down

27 end

27 high

27 may

27 off

27 or

27 sir

27 world

26 campaign

26 city

26 commission

26 force

26 men

26 northern

26 parliament

26 state

26 united

25 area

25 held

25 least

25 number

24 conference

24 drugs

24 free

24 major

24 officials

24 put

24 states

23 chairman

23 charged

23 decision

23 ec

23 industry

23 ireland

23 officers

23 still

23 them

23 thought

23 time

23 under

23 visit

23 way

23 where

22 changes

22 fire

22 forces

22 help

22 later

22 left

22 seven

22 support

22 troops

21 border

21 committee

21 correspondent

21 defence

21 dr

21 elections

21 environment

21 inquiry

21 offer

21 opposition

21 public

21 stations

21 strike

21 take

21 well

20 agreement

20 another

20 central

20 commons

20 congress

20 days

20 families

20 hungary

20 leading

20 make

20 military

20 money

20 ms

20 place

20 protest

20 reports

20 shot

20 staff

20 use

20 want

19 agreed

19 authorities

19 authority

19 begun

19 bomb

19 bush

19 capital

19 cut

19 feed

19 following

19 including

19 plant

19 scotland

19 speaking

19 station

18 armed

18 case

18 community

18 death

18 due

18 energy

18 europe

18 further

18 given

18 go

18 head

18 increase

18 kong

18 led

18 most

18 official

18 programme

18 refused

18 several

18 so

18 social

18 taking

18 those

18 tomorrow

18 town

18 transport

18 violence

18 warned

18 working

17 already

17 attack

17 ban

17 become

17 billion

17 china

17 claims

17 county

17 earlier

17 gorbachev

17 investigation

17 meanwhile

17 meet

17 through

17 without

16 africa

16 anti

16 appeal

16 arrived

16 beirut

16 cabinet

16 chancellor

16 come

16 david

16 economic

16 give

16 him

16 hong

16 house

16 involved

16 israeli

16 lebanon

16 management

16 many

16 months

16 morning

16 mps

16 natwest

16 outside

16 poland

16 prague

16 radio

16 senior

16 sent

16 set

16 teachers

16 used

16 vote

16 wanted

16 wants

16 war

15 *

15 accident

15 appear

15 around

15 christian

15 collision

15 companies

15 contaminated

15 control

15 department

15 embassy

15 essex

15 increased

15 information

15 issue

15 jobs

15 member

15 news

15 oil

15 policy

15 post

15 second

15 services

15 shadow

15 stop

15 thousands

15 wales

14 according

14 accused

14 allegations

14 austria

14 boat

14 body

14 debate

14 education

14 full

14 holding

14 interest

14 ira

14 japan

14 kinnock

14 leadership

14 ministry

14 move

14 non

14 others

14 rate

14 sea

14 statement

14 summit

14 sweden

14 television

14 while

14 women

13 ago

13 american

13 appeared

13 believed

13 clarke

13 continue

13 countries

13 day

13 dead

13 did

13 eight

13 fans

13 farms

13 indian

13 issued

13 lead

13 local

13 long

13 lost

13 mass

13 miles

13 nearly

13 neil

13 officer

13 payments

13 peter

13 politburo

13 political

13 press

13 province

13 reforms

13 remain

13 resigned

13 responsible

13 royal

13 seized

13 sunday

13 threat

13 took

13 until

13 widespread

12 added

12 aids

12 attempt

12 better

12 black

12 board

12 both

12 build

12 can

12 claimed

12 co

12 coalition

12 collided

12 connection

12 criminal

12 deng

12 despite

12 explosion

12 fell

12 ferry

12 fired

12 gas

12 goods

12 groups

12 industrial

12 international

12 killing

12 krenz

12 manchester

12 march

12 mp

12 nhs

12 offered

12 opened

12 plan

12 previous

12 questioned

12 rejected

12 released

12 result

12 rise

12 saturday

12 seats

12 seen

12 ship

12 suspended

12 uk

12 urged

12 very

12 voted

12 wednesday

12 won

11 although

11 ambulances

11 asked

11 bank

11 believe

11 bid

11 call

11 challenge

11 charge

11 church

11 confirmed

11 cost

11 cover

11 czechoslovakia

11 de

11 deal

11 democrats

11 denied

11 director

11 discuss

11 every

11 follows

11 france

11 friday

11 future

11 great

11 greater

11 hurd

11 ill

11 imposed

11 india

11 kenneth

11 launched

11 legal

11 likely

11 line

11 moscow

11 must

11 parkinson

11 possible

11 proposed

11 protests

11 published

11 questioning

11 rail

11 replaced

11 ruling

11 scottish

11 serious

11 seriously

11 tv

11 walked

11 welcomed

11 woman

11 yorkshire

10 afternoon

10 agency

10 aid

10 alleged

10 among

10 announcement

10 began

10 belfast

10 berlin

10 cape

10 carried

10 cause

10 conditions

10 democracy

10 document

10 early

10 engine

10 ex

10 far

10 figures

10 find

10 ford

10 front

10 hold

10 hours

10 however

10 include

10 jaguar

10 king

10 klerk

10 known

10 law

10 laws

10 m

10 measures

10 might

10 missing

10 needed

10 normally

10 november

10 parties

10 phillips

10 present

10 pressure

10 red

10 reduction

10 refugees

10 release

10 republic

10 rights

10 ruled

10 schools

10 seeking

10 share

10 show

10 site

10 soon

10 standards

10 total

10 walker

10 written

10 yard

10 york

9 african

9 again

9 animal

9 association

9 awarded

9 brought

9 calling

9 casualties

9 centre

9 change

9 charges

9 coast

9 condemned

9 conspiracy

9 continuing

9 cook

9 crash

9 crime

9 critical

9 delegates

9 demand

9 described

9 douglas

9 each

9 eastern

9 economy

9 exodus

9 failed

9 fighting

9 football

9 forum

9 fund

9 glasgow

9 holland

9 human

9 independent

9 inflation

9 investment

9 irish

9 magistrates

9 making

9 much

9 newspaper

9 offences

9 passengers

9 peace

9 person

9 privatisati

9 rally

9 reached

9 rebels

9 refusing

9 relations

9 robert

9 sank

9 scheme

9 september

9 such

9 then

9 train

9 ulster

9 whether

9 worth

9 wounded

8 act

8 air

8 allow

8 amount

8 announce

8 annual

8 archbishop

8 attacked

8 available

8 away

8 business

8 came

8 care

8 cases

8 catholic

8 child

8 civil

8 clear

8 college

8 colombia

8 costs

8 couple

8 crashed

8 cross

8 custody

8 customs

8 czechoslova

8 december

8 decided

8 democratic

8 deputy

8 do

8 earth

8 el

8 electrical

8 electricity

8 family

8 ferranti

8 financial

8 gerasimov

8 guildford

8 hindu

8 included

8 investigate

8 israel

8 jiang

8 joint

8 justice

8 keep

8 kept

8 operation

8 pact

8 passenger

8 policies

8 poll

8 posts

8 princess

8 privatised

8 pro

8 research

8 resignation

8 review

8 road

8 role

8 saying

8 situation

8 solidarity

8 sources

8 spent

8 st

8 start

8 supplies

8 surrounded

8 survey

8 syrian

8 takeover

8 temple

8 term

8 terrorist

8 terrorists

8 threatened

8 transplant

8 unsafe

8 victims

8 wakeham

8 western

8 what

8 wife

8 yet

8 zone

7 able

7 across

7 affected

7 aged

7 airport

7 alan

7 amnesty

7 anthony

7 barnett

7 biggest

7 birmingham

7 bring

7 brooke

7 buy

7 camp

7 camps

7 cannahis

7 coach

7 cocaine

7 confirm

7 consider

7 constable

7 convicted

7 criticised

7 current

7 details

7 diplomat

7 dismissed

7 drew

7 dropped

7 egon

7 ensure

7 exchange

7 expecting

7 exploded

7 fall

7 forward

7 freedom

7 french

7 gould

7 hit

7 hungarian

7 hurricane

7 imported

7 introduced

7 jordan

7 judge

7 just

7 kohl

7 latest

7 leaving

7 letter

7 liberal

7 lifted

7 live

7 market

7 midnight

7 minutes

7 monday

7 monopoly

7 newcombe

7 open

7 orders

7 overtime

7 parents

7 paul

7 plane

7 planes

7 private

7 problem

7 production

7 pupils

7 question

7 rather

7 ridley

7 salvador

7 same

7 settlement

7 shares

7 shop

7 southern

7 special

7 stay

7 system

7 thames

7 too

7 trial

7 try

7 tuc

7 ubs

7 unless

7 unrest

7 wall

7 warsaw

7 weekend

7 whose

6 advanced

6 advertising

6 age

6 agriculture

6 ahead

6 airways

6 allowed

6 almost

6 aoun

6 apparently

6 areas

6 arms

6 attacks

6 attempted

6 august

6 banks

6 based

6 begin

6 best

6 blamed

6 captain

6 carnogursky

6 cash

6 civic

6 clashes

6 close

6 colombian

6 concern

6 concerned

6 consumer

6 continued

6 controls

6 councils

6 declared

6 defraud

6 disaster

6 discussed

6 duty

6 elected

6 employees

6 engineering

6 evidence

6 factory

6 february

6 figure

6 firemen

6 food

6 forced

6 form

6 fraud

6 fuels

6 funds

6 game

6 george

6 good

6 gordon

6 guerrilla

6 guerrillas

6 having

6 hearing

6 higher

6 himself

6 hoped

6 hospitals

6 hostages

6 hundred

6 important

6 improved

6 infected

6 inside

6 interview

6 invasion

6 island

6 jailed

6 james

6 join

6 journalist

6 language

6 lanka

6 large

6 largest

6 leaked

6 lebanese

6 less

6 life

6 like

6 link

6 loan

6 loans

6 lord

6 magazine

6 malta

6 negotiating

6 nicholas

6 normal

6 occurred

6 ordered

6 organisatio

6 own

6 package

6 paid

6 paris

6 passing

6 philippines

6 pledged

6 polish

6 polling

6 poole

6 position

6 prevent

6 prices

6 prison

6 promised

6 proposals

6 protection

6 provide

6 radios

6 raised

6 rates

6 received

6 resign

6 resulted

6 results

6 right

6 robin

6 roger

6 rule

6 saudi

6 seek

6 select

6 sell

6 sentence

6 separate

6 short

6 shots

6 smith

6 speak

6 squad

6 sri

6 started

6 steps

6 stockbroker

6 student

6 supply

6 suspected

6 tax

6 teacher

6 team

6 tests

6 trying

6 unity

6 using

6 vehicles

6 warns

6 white

6 winds

6 withdrawn

6 yarmouth

5 '

5 abolition

5 acting

5 agenda

5 aircraft

5 alternative

5 america

5 anniversary

5 answer

5 applied

5 approved

5 april

5 arbitration

5 arrive

5 arts

5 assets

5 assistant

5 author

5 b

5 banned

5 birth

5 blast

5 bonn

5 book

5 br

5 breaking

5 brigade

5 bringing

5 britons

5 broke

5 building

5 built

5 buying

5 canterbury

5 capsized

5 caused

5 cheshire

5 clearing

5 closed

5 coal

5 collins

5 conflict

5 conservativ

5 considered

5 consortium

5 constitutio

5 convictions

5 cornwall

5 deaths

5 decide

5 delay

5 derbyshire

5 development

5 disease

5 double

5 drowned

5 dutch

5 efforts

5 employers

5 ended

5 english

5 equal

5 establish

5 estimated

5 even

5 evening

5 ever

5 exclusion

5 experts

5 facing

5 failure

5 fellow

5 fight

5 followed

5 friends

5 fully

5 funding

5 fw

5 get

5 got

5 green

5 half

5 heart

5 heavy

5 helicopters

5 highly

5 homes

5 how

5 huge

5 improve

5 incentives

5 incident

5 includes

5 indicated

5 injuries

5 insufficien

5 involving

5 iraq

5 islamic

5 japanese

5 jury

5 key

5 kidnapped

5 killings

5 late

5 lose

5 mainly

5 mainten

5 malcolm

5 marcos

5 mark

5 markets

5 match

5 mayor

5 mccarth

5 mean

5 migrant

5 mikhail

5 modrow

5 moment

5 motion

5 nato

5 nearby

5 negotia

5 norfolk

5 notting

5 ozone

5 pakista

5 panoram

5 park

5 passed

5 patten

5 plants

5 players

5 pope

5 prevent

5 previou

5 price

5 profits

5 provide

5 raids

5 rajiv

5 re

5 reactor

5 ready

5 recorde

5 region

5 remande

5 remove

5 residen

5 restric

5 rig

5 riot

5 river

5 robbery

5 ruc

5 rushdie

5 safety

5 school

5 search

5 secret

5 see

4 abroad

4 abuse

4 accept

4 access

4 adamec

4 admitted

4 advice

4 affair

4 aimed

4 allan

4 allegedly

4 allowing

4 answering

4 antrim

4 antwerp

4 anyone

4 appealed

4 arab

4 arabia

4 argentina

4 arrow/count

4 article

4 assurances

4 attempts

4 attended

4 ayodha

4 ba

4 badly

4 ballot

4 bar

4 barricaded

4 base

4 basildon

4 battle

4 bazoft

4 became

4 becoming

4 beef

4 behind

4 believes

4 betting

4 bill

4 blackpool

4 blue

4 bnfl

4 boost

4 breach

4 briton

4 broadcastin

4 bryan

4 bse

4 budapest

4 burma

4 burning

4 cable

4 campaigner

4 cancelled

4 chance

4 charities

4 cheap

4 chris

4 clash

4 climbdown

4 club

4 comes

4 comment

4 commissione

4 commitment

4 commonwealt

4 compensatio

4 compromise

4 condition

4 container

4 cope

4 courts

4 credit

4 cricklewood

4 crowd

4 customers

4 czech

4 dagenham

4 damage

4 data

4 dealing

4 defensive

4 demands

4 derby

4 desmond

4 destruction

4 devices

4 dinkins

4 disposal

4 division

4 docked

4 doctors

4 dollar

4 domestic

4 done

4 dorset

4 drinking

4 driver

4 drop

4 earmarked

4 edinburgh

4 effect

4 effective

4 enough

4 entered

4 environment

4 executions

4 executive

4 executives

4 existing

4 experimental

4 filled

4 final

4 flights

4 flown

4 forged

4 foundation

4 fourth

4 frigate

4 funeral

4 gale

4 gandhi

4 gennady

4 governments

4 grant

4 grants

4 guard

4 gummer

4 gunmen

4 guns

4 haemophilia

4 halted

4 hampshire

4 handling

4 hands

4 hans

4 haughey

4 helmut

4 herrhausen

4 historic

4 holiday

4 honecker

4 houses

4 housing

4 hugo

4 inadequate

4 incidents

4 increasing

4 independenc

4 institute

4 interocean

4 investigate

4 iran

4 islands

4 italy

4 jack

4 jail

4 january

4 judges

4 july

4 kent

4 knew

4 laundering

4 lawson

4 leave

4 legislation

4 let

4 levels

4 main

4 maisonette

4 mann

4 manual

4 marchioness

4 means

4 medellin

4 medical

4 memorial

4 menem

4 milk

4 millions

4 miners

4 monitored

4 monopolies

4 moslem

4 mosque

4 motorbike

4 movement

4 moves

4 murdering

4 narrowly

4 nationalist

4 nations

4 nationwide

4 nature

4 negotiators

4 nigel

4 nine

4 old

4 opening

4 opinion

4 opportunity

4 order

4 outlined

4 overall

4 overnight

4 p

4 paper

4 path

4 patients

4 patricia

4 patrolling

4 paying

4 pending

4 penguin

4 per

4 peru

4 pile

4 planning

4 platform

4 pleasure

4 points

4 policemen

4 polls

4 possibility

4 pravda

4 rebel

4 receive

4 recently

4 record

4 reduce

4 redundancie

4 reform

4 refusal

4 regiment

4 relatives

4 reporting

4 represent

4 represented

4 resume

4 resumed

4 retire

4 returning

4 reversed

4 rifkind

4 rises

4 risk

4 rival

4 rivers

4 romanjan

4 rumours

4 run

4 russian

4 sacked

4 sale

4 sales

4 salman

4 science

4 seat

4 sheffield

4 shipping

4 shooting

4 shortage

4 shown

4 singh

4 smuggled

4 sold

4 spoke

4 spotted

4 stable

4 step

4 stopped

4 strong

4 succeed

4 summonses

4 survivors

4 suspect

4 suspects

4 suspicious

4 synod

4 tadeusz

4 telephone

4 telephones

4 tour

4 towed

4 tractor

4 traffic

4 treasury

4

4 tried

4 trouble

4 turned

4 tutu

4 unofficial

4 urging

4 vauxhall

4 verdict

4 via

4 victory

4 village

4 visited

4 volcano

4 voluntary

4 voting

4 waddington

4 waiting

4 ways

4 why

4 willesden

4 works

4 worse

4 worst

4 writs

4 xiaoping

4 young

4 zemin

3 abolish

3 abortion

3 accompanied

3 accusing

3 additional

3 address

3 adviser

3 aerospace

3 affairs

3 affect

3 africans

3 aim

3 disling

3 alexander

3 alleviate

3 alliance

3 allies

3 along

3 alongside

3 ambulanceme

3 amid

3 ammunition

3 amounting

3 arm

3 arrest

3 arrival

3 artist

3 asia

3 asian

3 assembly

3 attending

3 austrian

3 babies

3 baghdad

3 bail

3 barriers

3 bases

3 beating

3 beginning

3 below

3 benefits

3 bills

3 birthday

3 bit

3 blocking

3 blood

3 boesak

3 bombing

3 bombings

3 born

3 bratislava

3 break

3 brian

3 bridge

3 britoil

3 broadcaster

3 bus

3 cathedral

3 cbi

3 cent

3 changed

3 channel

3 charles

3 chau

3 chemical

3 chemicals

3 choice

3 cholera

3 chosen

3 christmas

3 chunnel

3 circumstanc

3 clapham

3 class

3 climate

3 coca

3 collapsed

3 colony

3 compared

3 complex

3 controversi

3 corruption

3 counter

3 counterfeit

3 counterpart

3 coup

3 crisis

3 critically

3 criticising

3 crossing

3 crown

3 cup

3 cuts

3 cyprus

3 daily

3 damaged

3 danger

3 dealings

3 debating

3 decisions

3 deficit

3 deputies

3 destroyed

3 detectives

3 development

3 device

3 disappeared

3 discovered

3 discussions

3 disrupt

3 dissident

3 documents

3 dollars

3 dominating

3 drawn

3 dubcek

3 dublin

3 duchess

3 e

3 elect

3 eleven

3 elsewhere

3 emigrate

3 emissions

3 encourage

3 equipment

3 escalation

3 escaped

3 estate

3 estimates

3 ethnic

3 excluded

3 explain

3 extra

3 extraordina

3 eye

3 face

3 fear

3 fears

3 federal

3 ferdinand

3 fewer

3 fireman

3 firms

3 forests

3 formal

3 fought

3 garcia

3 gathered

3 gave

3 geoffrey

3 georgia

3 germanys

3 gerry

3 gloucesters

3 going

3 greece

3 greenhouse

3 growing

3 growth

3 guarantee

3 gun

3 gunman

3 hand

3 handled

3 headed

3 heading

3 headquarter

3 heard

3 helicopter

3 helping

3 heroin

3 heseltine

3 hill

3 hole

3 holidays

3 hope

3 hotel

3 hour

3 hundreds

3 hunt

3 hurt

3 husak

3 husband

3 i

3 ii

3 illegally

3 immediate

3 imports

3 industries

3 injuring

3 inquiries

3 instead

3 intercity

3 iraqi

3 isle

3 jaruzel

3 jet

3 jewelle

3 job

3 judicia

3 june

3 jungle

3 kind

3 lack

3 ladisla

3 latter

3 layer

3 leaking

3 liberat

3 lift

3 limit

3 linked

3 list

3 1ockerb

3 losses

3 lot

3 low

3 lower

3 loyalis

3 luxury

3 mail

3 maintai

3 maintai

3 manager

3 managin

3 manila

3 marched

3 margate

3 marine

3 materid

3 mcginni

3 measure

3 mechani

3 message

3 metropo

3 meyer

3 missile

3 mixed

3 moldavi

3 moldavi

3 mortgag

3 motors

3 moved

3 mudwad

3 multi

3 murdere

3 navy

3 negotia

3 neither

3 newcast

3 observer

3 obtain

3 occupied

3 offensive

3 offices

3 officially

3 often

3 ortega

3 osman

3 our

3 outbreak

3 outcome

3 outskirts

3 oxfordshire

3 paintings

3 palace

3 parliamenta

3 particularl

3 partner

3 partners

3 patient

3 personnel

3 persuade

3 petrol

3 placing

3 planted

3 played

3 plunged

3 plymouth

3 policeman

3 pollution

3 poor

3 popular

3 postal

3 practices

3 pregnant

3 premier

3 presidentia

3 pressurised

3 prestwick

3 primary

3 progress

3 project

3 prosecuted

3 prosecution

3 prospect

3 prospects

3 providing

3 puppet

3 quality

3 radiotherap

3 range

3 rays

3 reach

3 recognise

3 reconsidere

3 reformers

3 repeal

3 repeated

3 repeatedly

3 rescue

3 reshuffle

3 resignation

3 response

3 restore

3 restriction

3 restructure

3 returned

3 reunificati

3 reuniting

3 richard

3 rigging

3 rising

3 roads

3 rocket

3 romania

3 rome

3 room

3 routes

3 rover

3 runcie

3 safe

3 san

3 satanic

3 save

3 saw

3 scale

3 scientists

3 section

3 segregated

3 seizure

3 self

3 sellafield

3 send

3 sending

3 sensitive

3 sentenced

3 sergeant

3 served

3 sexual

3 shops

3 signs

3 similar

3 sizewell

3 slump

3 smaller

3 smyth

3 soccer

3 socialist

3 solicitor

3 son

3 southwark

3 speculation

3 split

3 standard

3 stands

3 star

3 starting

3 steel

3 stepped

3 stepping

3 stock

3 stockholm

3 strongly

3 structure

3 sun

3 sutcliffe

3 talk

3 tanker

3 task

3 tell

3 tens

3 theatre

3 thieves

3 together

3 tom

3 tomes

3 tough

3 trace

3 traffickers

3 tragedy

3 transaction

3 trust

3 tuesday

3 twice

3 tyrone

3 un

3 unable

3 unidentifie

3 unit

3 unrealistic

3 usa

3 vacancies

3 verses

3 veto

3 vice

3 victim

3 view

3 violent

3 votes

3 walesa

3 warren

3 whitley

3 whole

3 whom

3 wigan

3 wight

3 wiltshire

3 winter

3 wirral

3 withdraw

2 abu

2 acas

2 accepted

2 accidents

2 account

2 accounted

2 accounts

2 activists

2 activities

2 add

2 addition

2 addresses

2 adequate

2 adjourned

2 advertiseme

2 adverts

2 advisors

2 airlines

2 albanian

2 alfred

2 algeria

2 ali

2 alive

2 amassing

2 amateur

2 ambassador

2 anderton

2 anger

2 anglia

2 angry

2 announcing

2 anonymous

2 antarctic

2 anticipated

2 antonio

2 apartment

2 appalling

2 apply

2 appointment

2 approach

2 arens

2 argued

2 arrears

2 arrests

2 arriving

2 ask

2 assassinati

2 assault

2 atmosphere

2 attempting

2 attracted

2 auditor

2 australia

2 automatic

2 availabilit

2 average

2 avert

2 bag

2 bahamas

2 baker

2 bakery

2 bakker

2 balance

2 ballistic

2 balloted

2 banham

2 banker

2 banking

2 banners

2 banning

2 barons

2 barrage

2 battles

2 bavaria

2 beaten

2 beckerr

2 becomes

2 behalf

2 belief

2 belts

2 bench

2 benefit

2 benidorm

2 beverley

2 big

2 bihar

2 bike

2 birds

2 bitter

2 blackout

2 blacks

2 blew

2 block

2 boateng

2 boats

2 boeing

2 bogota

2 bolivia

2 bookshop

2 bp

2 brain

2 brazilian

2 bribe

2 briefing

2 brighton

2 broad

2 broadcast

2 brother

2 brown

2 brussels

2 brutality

2 bsc

2 buckingham

2 buildings

2 busy

2 campaigning

2 canada

2 candidate

2 carbon

2 carlisle

2 carlos

2 carrying

2 cartel

2 catastrophi

2 caught

2 caulton

2 caution

2 celebrated

2 centres

2 ceremonies

2 cfcs

2 cha

2 chances

2 checkpoint

2 chelsea

2 chinese

2 choose

2 cities

2 citizens

2 civilians

2 claimants

2 claiming

2 classroom

2 cleric

2 clubs

2 clwyd

2 coetzee

2 cold

2 colonel

2 comaneci

2 combat

2 commanded

2 commemorate

2 commercial

2 committed

2 common

2 comparable

2 complete

2 completed

2 comply

2 confederati

2 confessions

2 confidence

2 confident

2 confidentia

2 confrontati

2 consignment

2 conspiring

2 constituenc

2 contact

2 contained

2 crack

2 crackdown

2 create

2 created

2 crew

2 criticism

2 crowded

2 crowds

2 cultural

2 cunningham

2 curfew

2 cutbacks

2 dawes

2 deadline

2 dealers

2 debts

2 decline

2 decrease

2 deducted

2 deep

2 defeated

2 defences

2 defended

2 defending

2 delayed

2 delegation

2 delors

2 demanding

2 demonstrati

2 demonstrati

2 demonstrato

2 denies

2 depletion

2 deposed

2 depriving

2 derek

2 designated

2 determined

2 deterred

2 developed

2 developing

2 devonport

2 dialogue

2 dickel

2 differences

2 different

2 difficult

2 difficulty

2 dioxide

2 diplomats

2 disagreemen

2 disbanded

2 discotheque

2 discovery

2 discussing

2 dishes

2 dismantling

2 dixon

2 diy

2 doctor

2 does

2 donald

2 downing

2 dredger

2 dresden

2 dressed

2 drought

2 dunoon

2 durban

2 eduardo

2 effects

2 efficiency

2 elderly

2 electronics

2 ellesmere

2 else

2 emigrants

2 encouraging

2 enforce

2 engaged

2 enter

2 entering

2 epidemic

2 erich

2 escapees

2 escort

2 esso

2 evacuated

2 evacuation

2 events

2 everyone

2 exercise

2 expects

2 experience

2 exploration

2 explosions

2 exposed

2 express

2 expressed

2 extradition

2 extreme

2 fa

2 faces

2 faction

2 fair

2 faldo

2 faulty

2 feet

2 fighters

2 finalised

2 finnish

2 firefighter

2 firm

2 firmly

2 fog

2 follow

2 forcibly

2 foreigners

2 forest

2 formed

2 forming

2 fortune

2 forty

2 fossil

2 fourteen

2 frank

2 frankfurt

2 fulham

2 gain

2 gang

2 garrison

2 generally

2 getting

2 ginniff

2 giuliani

2 giving

2 global

2 glyndwr

2 goal

2 goes

2 gone

2 goodwin

2 gotland

2 guarantees

2 guards

2 guest

2 guilty

2 gulf

2 gyula

2 hammersmith

2 hammond

2 handed

2 handicapped

2 happen

2 happened

2 hard

2 hardliners

2 harmful

2 haul

2 hayward

2 heads

2 heathrow

2 here

2 hesitate

2 hidden

2 highlight

2 hillsboroug

2 hindley

2 hired

2 history

2 homeless

2 humber

2 hussein

2 ian

2 identif

2 illegal

2 impact

2 impleme

2 improve

2 inactiv

2 infecti

2 informe

2 ingleto

2 injunct

2 innocen

2 insider

2 install

2 intelli

2 intende

2 intensi

2 inter

2 introdu

2 investm

2 invited

2 iramedia

2 isc

2 issues

2 issuing

2 italian

2 italian

2 itself

2 jan

2 jihad

2 jim

2 joined

2 joining

2 jones

2 jordani

2 jose

2 journey

2 junctio

2 junior

2 justifi

2 kaifu

2 karolyi

2 katyush

2 kelly

2 kidnapp

2 killer

2 kilos

2 kingdom

2 klawer

2 kurds

2 lancash

2 landing

2 latin

2 launch

2 lawfull

2 lecturers

2 leeds

2 leicestersh

2 leon

2 level

2 libel

2 liberties

2 light

2 locked

2 londonderry

2 looking

2 lowest

2 lubbers

2 lubowski

2 lump

2 macfarlaine

2 machine

2 mafia

2 maginn

2 maguires

2 maigret

2 mainland

2 maize

2 mallon

2 mansfield

2 manslaughte

2 manuel

2 margate

2 maria

2 marines

2 maronite

2 martin

2 marxist

2 matter

2 maude

2 maximum

2 mayoral

2 mcginn

2 mconie

2 meat

2 media

2 meets

2 meibion

2 mellor

2 memo

2 merseyside

2 messages

2 metal

2 michael

2 michelle

2 mistrust

2 mladenov

2 modern

2 monitor

2 moors

2 mother

2 motivated

2 myra

2 n

2 named

2 nationals

2 natural

2 necessary

2 needs

2 neighbourin

2 network

2 newly

2 nicaragua

2 nidal

2 nor

2 notes

2 nujoma

2 nupe

2 object

2 objections

2 obtained

2 once

2 operations

2 opportuniti

2 opposed

2 optimistic

2 opting

2 option

2 ordination

2 outline

2 outlining

2 overboard

2 overwhelmin

2 pacific

2 packages

2 palermo

2 pan

2 papers

2 paramilitar

2 parked

2 participati

2 parts

2 past

2 peacekeepin

2 pensioners

2 pentagon

2 period

2 permission

2 permit

2 peruvian

2 phetchaburi

2 phone

2 photographs

2 picasso

2 pitra

2 planned

2 politics

2 polytechnic

2 poorer

2 preliminary

2 preparation

2 prepare

2 presented

2 presenting

2 presidency

2 priests

2 prince

2 printing

2 privileges

2 probably

2 problems

2 process

2 produce

2 produced

2 producers

2 professiona

2 promote

2 prosecution

2 provided

2 provisional

2 provisions

2 pub

2 purpose

2 quasar

2 quickly

2 radiation

2 radical

2 raf

2 raided

2 railways

2 rainfall

2 raising

2 rape

2 reaffirmed

2 real

2 rear

2 reception

2 recognition

2 recommend

2 recommended

2 recovery

2 recruiting

2 referendum

2 referred

2 refuge

2 refugee

2 regional

2 reinforced

2 reiterated

2 reject

2 rejection

2 religious

2 remained

2 remaining

2 removed

2 repatriatio

2 required

2 resin

2 resistance

2 resolved

2 respect

2 respond

2 retail

2 retired

2 revolution

2 rifles

2 ring

2 rioting

2 riyadh

2 rockets

2 roman

2 rose

2 row

2 rowntree

2 rudolf

2 rudolph

2 rugby

2 rumbold

2 rumour

2 running

2 runs

2 s

2 sack

2 sailing

2 sailor

2 sam

2 sanctions

2 satisfied

2 scaled

2 scientific

2 scrapped

2 screening

2 sdlp

2 searching

2 sector

2 sedition

2 seems

2 sees

2 seizures

2 sek

2 semi

2 series

2 setting

2 settled

2 severe

2 severely

2 sex

2 sfeir

2 shareholder

2 sharp

2 sharply

2 shelter

2 shevardnadz

2 sign

2 signal

2 significant

2 simenon

2 simon

2 skipper

2 slightly

2 slovaks

2 slowing

2 smashed

2 smuggling

2 sncf

2 solicitors

2 somerset

2 somogyi

2 source

2 sparked

2 speaker

2 speakers

2 specialist

2 speed

2 spied

2 spread

2 stabbing

2 stadium

2 staid

2 standby

2 standing

2 stated

2 statements

2 status

2 staying

2 stealing

2 stephen

2 sterling

2 stevens

2 stones

2 stood

2 storm

2 stranded

2 strategic

2 streets

2 strength

2 strengtheni

2 stressed

2 strict

2 strikes

2 striking

2 stringent

2 stronger

2 studies

2 studios

2 studying

2 subject

2 subsidiary

2 sue

2 suggest

2 suprise

2 surgery

2 surrey

2 sutherland

2 swanley

2 swedish

2 swire

2 swiss

2 switzerland

2 tackle

2 tai

2 takes

2 tankers

2 taxis

2 teaching

2 tebbit

2 technology

2 televising

2 terminal

2 terms

2 territories

2 thai

2 thailand

2 theft

2 these

2 thick

2 things

2 thomas

2 threaten

2 threatening

2 thrown

2 thursday

2 ties

2 tilting

2 tolba

2 tongue

2 tool

2 torture

2 totally

2 tourism

2 tourist

2 tourists

2 tracking

2 trades

2 trains

2 trans

2 transferred

2 travelling

2 trawler

2 treasured

2 treating

2 trend

2 tropical

2 truce

2 truck

2 tune

2 turkey

2 unionist

2 unprecedent

2 unusual

2 unveiled

2 upon

2 uvf

2 vacant

2 van

2 various

2 vasconcello

2 vehicle

2 vessel

2 veterans

2 video

2 viewers

2 views

2 villa

2 violated

2 virginia

2 virtually

2 visa

2 vishwanath

2 visitors

2 voice

2 void

2 volunteers

2 vorkuta

2 vredendal

2 wages

2 wake

2 walkout

2 ward

2 warehouse

2 warming

2 waste

2 watchdog

2 watched

2 we

2 wear

2 weather

2 whatsoever

2 widow

2 wildlife

2 wilks

2 william

2 wilson

2 windhoek

2 witnesses

2 woolwich

2 worker

2 workman

2 workmen

2 worldwide

2 worsen

2 wounding

2 writ

2 xinjiang

1 aboard

1 abode

1 aborted

1 absence

1 absent

1 absolutely

1 abused

1 abuses

1 academic

1 accelerated

1 accents

1 acceptable

1 accepting

1 accessible

1 accommodate

1 accommodati

1 accomplice

1 accordingly

1 accountants

1 accusation

1 achievable

1 achieving

1 acknowledge

1 acknowledge

1 acquire

1 acquitted

1 acted

1 activist

1 activistis

1 activity

1 acts

1 adami

1 addenbrooke

1 addressing

1 adds

1 adele

1 adjusted

1 adjustment

1 administere

1 admitting

1 adopt

1 adopted

1 advance

1 advise

1 advisers

1 advisory

1 advocates

1 afford

1 afterwards

1 aggravated

1 aggressive

1 agreeing

1 agreements

1 aims

1 airbase

1 aires

1 airline

1 alarming

1 alcohol

1 aldergrove

1 alebrto

1 alerted

1 alight

1 alison

1 allegation

1 alleging

1 allen

1 allied

1 allocate

1 allotments

1 allows

1 alone

1 alun

1 always

1 am

1 amassed

1 amazon

1 ambassador

1 amendments

1 americas

1 amethi

1 amongst

1 amounted

1 amounts

1 anaylysts

1 anc

1 andean

1 anders

1 andreas

1 andreotti

1 andrew

1 angela

1 anglicans

1 anglo

1 angola

1 ann

1 annoucement

1 announcment

1 anonymously

1 answered

1 anton

1 anxiety

1 anything

1 anywhere

1 ap

1 apap

1 apart

1 apology

1 apparent

1 appeals

1 appearance

1 appears

1 appliances

1 application

1 arctic

1 ardboe

1 ardoyne

1 argue

1 argumen

1 armoure

1 arne

1 arrival

1 arsenal

1 arson

1 arthur

1 article

1 artille

1 ash

1 ashdown

1 ashore

1 asking

1 aspect

1 assasin

1 assassi

1 assault

1 asse~0

1 assembl

1 assess

1 assessin

1 assignm

1 assisti

1 assocat

1 assumes

1 astrono

1 asylum

1 atlanta

1 attache

1 attacki

1 attend

1 attitud

1 attorne

1 au

1 audi

1 audit

1 auguste

1 authori

1 authori

1 autonora

1 autumn

1 auxilia

1 avenge

1 avianca

1 avon

1 aware

1 awarene

1 axe

1 ayatoll

1 ayrshir

1 backed

1 backgro

1 bad

A2.4.2 Graph of Frequency vs Rank in Text Retrieval Experiments

Click here for Picture

A close up of the hyperbola obtained when plotting the frequency (y axis) of a word against its rank (x axis, /1000).