Technical Report No. ECSTR-IAM00-1 : Last Revision : 29/02/2000
Multimedia Research Group, University of Southampton, UK
The Web is a linked literature: we wish to discover what the authors of Web pages are choosing to link and what they are choosing to link to. It is hoped that understanding interconnectedness as it is practised in the Web through links will enable us to see what kinds of hypertext are achievable using common technologies and what is impracticable. Understanding the arrangement of the links helps us to understand the construction of the page as a whole which in turn helps us to understand the purpose of the links. This paper discusses a search for examples of good subject-based hypertext linking, the linking statistics that we drew from those pages and the linking practises that the statistics represent. We also show how the analysis of how hypertext links are written can be applied to the problem of Web page reading in non-standard and reduced-bandwidth Web browsing applications.
hypertext analysis, hypertext linking style, hypertext metrics, PDA browsers
As a group that has spent many years applying open hypermedia technologies to the Web, we have long been interested to find out how in practice people are building Web sites today, what they link and what they link to. The hypothesis was that the results of such analysis might inform us about when and where open hypermedia technology might be most usefully applied and help us to design better systems to support link authoring and maintenance. We were keen to map the types of links authored in Web pages onto an established link taxonomy, such as [DeRose 1989] and see which types of links were most used in practice, and if any patterns emerged. However, when we recently embarked on a study of the application of hypermedia in the Web today, we were surprised to discover, largely by inspection, just how much of the Web (99%) is simple navigation - both structured and ad-hoc, non-hierarchical "related information" links. There are very few examples of Web sites where there is evidence of associative linking from within the content sections of documents i.e inline linking. So rather than having many sites to analyse, we found ourselves asking the question - is the Web killing hypermedia?
In order to try to answer this question, we needed to establish a method of automatically analysing Web pages to determine how much inline linking they contained, if any, as well as what was being linked to what. The most common Web page authoring practise is to separate the content of the document from the navigational links, making a separate navigational section or independent frame. Whether they are personal home pages, commercial Web sites or educational texts, the common practise is to split the page into content regions and functional regions, with the information being displayed in the content region (with or without any inline/associative linking) and the navigation (next page, search, home, related links) in the easily distinguished functional regions at the top, bottom or left-hand margin of the page.
Of course not all hypermedia researchers would agree that linking from within the content section of a document is a good thing. Some hypermedia design models, such as HDM [Garzotto et al, 1995], focus on the connections between 'atomic pages' and discourage the use of inline linking so that the structure of the hypermedia is clearer to the reader and indeed the author. But generally since the first use of the terms hypertext and hypermedia by Nelson in the mid-1960's, associative linking on the content of documents has been at the heart of hypermedia authoring, and is the essence of non-sequential writing. To quote from DeRose 
"A relational link which is entirely unpredictable is called an associate link. Such links are the usual stock in trade of hypertext systems."
From our observations of Web sites, it would appear that neither approach is adopted on any large scale. Most sites neither apply a rigorous hypermedia design model, nor show any attempt to create inline links. In this paper we describe how we attempted to analyse Web sites that employ associative, inline linking in order to develop criteria for "good linking practice" on the Web. In order to automate or semi-automate this process we needed to develop a method of decomposing Web sites in order to automatically identify the contents section of documents. The results give us a method of "reading" Web pages that can be applied to enable the pages to be presented in non-standard and reduced-bandwidth Web browsing applications. Finally we discuss why there is so little "hypermedia" in the Web today and speculate on if this might change in the future.
Web Connectivity Analysis [Heylighen 1999] is the study of the use of linking on the WWW. In this discipline links have been used to categorise individual pages [Boyan 1996; Brin & Page 1998], to help segment a cluster of pages [Salton et al. 1996] or to analyse the Web as a whole [Raghavan et al. 1999]. For the purposes of this paper however, we are interested in the use of links within each page.
In order to identify sites with "good linking practice" it is necessary to decide on the criteria for "good linking". Assuming it is not something which can only be recognised when seen, but something that has identifiable and measureable qualities, then it ought to be possible to construct a Web search query that will return examples of good linking.
The only comment about linking of the Yale Manual of Web Style [Lynch and Horton 1997] is "The relatively plain, mostly text-based home page for the W3C offers a very efficient ratio of links per kilobyte of page size, but at some cost in pure visual appeal". Perhaps therefore good linking style is to be found in a high ratio of links to content, although this would tend to favour simple bookmark lists. More generally the manual only covers the use of hypertext links as a mechanism for localised hierarchical decompostion of the site's structure and not as a mechanism for content-rooted cross-referencing.
Even if a simple metric for "good hypertext" can be determined, existing search engines are of little help. For example, although AltaVista allows searching for pages containing individual links with specific anchor contents or destinations, it does not allow the aggregate link features of a page to be queried (e.g. to find any page with > n links or pages where link density = x). Depending upon the page design, linking can either be integrated with the content of the a Web page, occurring directly on the terms and phrases of interest within the flow of the argument, or else it can be kept as a separate activity that occurs in parallel frames or endnotes. Whether the layout is integrated or separated, the purpose of the links can be either to navigate around the site (site navigation) or to follow up specific points and issues raised on the current page (subject navigation).
Due to the initial lack of metrics for these kind of linking semantics combined with the absence of link querying facilities, we canvassed for recommendations among colleagues in our research group and from readers of "Hypertext Kitchen" (http://hypertext.pair.com/, a web site for hypertext writers and researchers). From our enquiries we discovered a small number of scientific and technical sites with interesting linking strategies and rather more in the hyperliterature community. Of these we chose to focus our analysis on the scientific/technical sites to work on some broad statistics that described each site's linking practise.
For the analyses we attempted to discover sites which integrated linking into the content regions as well as providing navigation in functional regions. In fact it is frequently difficult to make a clear distinction: "web log style" uses embedded links in short 'news' paragraphs (e.g. SlashDot http://www.slashdot.org/ and Scripting News http://www.scripting.org/) but it becomes difficult to tell whether the links annotate the news items or the sentence-long news items simply annotate the links.
A number of news sites (e.g. Wired, C-Net) only provide links within their stories on the names of commercial or institutional bodies that they mention, whereas the Alertbox site (http://www.useit.com/) links into the body of its current bulletin any relevant previous issues.
NASA's Astronomy Picture of the Day (http://antwrp.gsfc.nasa.gov/apod/) links into each day's text not just to relevant information from previous days, but also external educational and scientific Web pages which explain or illustrate any key phrases and technical terms used in the text. Scientific American (http://www.sciam.com/) provides a similar service for its "Enhanced Articles", also providing more general related article links in the page's navigation section. Although both these sites share a similar brief on the public understanding of science, Scientific American's British counterpart New Scientist (http://www.newscientist.com/) provides no within-text links, only separate navigation functions and lists of related articles. The same applies for the online version of National Geographic (http://www.nationalgeographic.com/).
The deliberate and innovative use of integrated hypertext is emerging in literary and academic writing, the novel 253 (http://www.ryman-novel.com/ ) and the essay E-Literacies (http://raven.ubalt.edu/staff/kaplan/lit/One_Beginning_417.html) are exemplars.
We can make use of various hypertext measures in an attempt to understand what people are linking and why. The intuitive "link density" measure allows us to determine how much linking exists in each page or across a site. We can check to see what is being linked on, what is being linked to and whether each site's linking is thorough (each linking opportunity is taken) and consistent (each linking opportunity is taken every time).
Other more elaborate measures can be taken. The dominance and connectedness [Jackson 1997] of a site indicate whether links are spread out fairly to all the possible destinations and the degree to which everythig is linked to everything else. A balanced hypertext structure is characterised by high connectedness and low dominance. Pages can be classified as hubs or authorities [Kleinberg 1998], according to whether they contain a lot of material about a topic or whether they contain lots of links to material about a topic.
Links offer the reader choice between navigational options. The amount of choice offered by each Web page can be expressed as the ratio of links to pages: if it is close to one, the site offers the minimal degree of choice for the reader; it is a sequence of pages linked only with a "Next" anchor where the reader can only go forward one step at a time [Dillon 1999].
Some of these measures mainly make sense when analysing a closed hypertext environment, others focus on the construction of a graph structure from a set of nodes. The intent of this paper by contrast is to try to determine the individual decisions and processes that authors make in creating hypertext links, consequently we focussed on the APOD site (1500 pages), the 'enhanced articles' subset of the SCIAM site (100 pages) and a random selection of feature articles from the New Scientist site (25 articles). The latter are only linked separately and for navigation purposes: it is included as a contrast to the other sites.
Figure 1a: Links per APOD article
Average: 32 Std Dev: 7
Number of Articles: 1500
Link density: 55%
Destination ratio: 56%
Figure 1b: Links per SCIAM article
Average: 52 Std Dev: 16.28
Number of articles: 100
Link density: 40%
Destination ratio: 53%
Figure 1c: Links per NewSci article
Average: 25 Std Dev: 0.69
Number of Articles: 25
Link density: 26%
Destination ratio: 71%
To determine a very crude measure of how much linkng is occuring, we simply counted the number of links in each article and plotted them over time (see figures 1a, 1b and 1c). The APOD graph clearly shows the way that the amount of linking increased as the site developed over the first issues; a similar trend is (just) visible in the SCIAM graph. To take into account the varying size of the pages the average link density was recorded as the number of bytes which implementing links divided by the number of bytes in the page.
Although 'average numbers of links' and 'average link densities' are recorded, we also need to distinguish between links that offer structural navigation (home or next page) versus associative links which are rooted in the content domain. Information about the link's destination may give some clues as to the purpose: on-site links may well be structural and off-site links are highly unlikely to be so. The destination ratio is the proportion of links whose destination is outside the current site. This statistic turns out to be highly misleading: visual inspection shows that many 'content links' on the APOD and SCIAM sites are to previous local articles, and that the NewSci site consists of a number of URL trees.
Plotting a graphical representation (figures 2a and 2b) of the byte offset of each link in the APOD site shows that there seem to be two clear bands of unchanging links: these are the navigation links seen at the top and bottom of each Web page.
Figure 2a: Clustering of links
offset from the beginning
of APOD pages
Figure 2b: Clustering of links
offset from the end of
An alternative approach was taken to determine the purpose of each link, based on the assumption that within a particular web page, navigation links are to be found in 'navigation sections' and 'content links' are to be found in 'content regions'. The following sections deal with attempts to interpret the Web pages so that we can determine the purpose of each region of the page and hence the purpose of each link.
Automatic decomposition of documents such as web pages into logical sections has been the topic of much research, but researchers seem to be focusing on the decomposition of structured or at least semi-structured documents, for example a web page containing a number of distinct records, such as that returned from a web database query.
Early research in this area enabled documents to be decomposed using manual [Atzeni and Mececca 1997, Hammer et al 1997] or semi-automatic [Adelberg 1998; Ashish and Knoblock 1997; Doorenbos, Etzioni, and Weld 1997; Kushmerick, Weld, and Doorenbos 1997; Soderland 1997] boundary discovery, followed by automatic record extraction based on the discovered record boundary (the record boundary usually takes the form of a single HTML tag, or sequence of HTML tags, that recurrs at the boundary between each record). Fully automatic decomposition of structured web pages using a heuristic algorithm in order to analyse a document’s structure and determine the boundaries which have been used to separate records within the document [Embley et al. 1999].
Although our data set was not as large as those used by other hypertext researchers to analyse hypertext documents [e.g. Woodruff et al. 1996], it was clear that an automatic decomposition algorithm was necessary. However, since the documents in our data set were largely unstructured in the sense that they did not conform to the "records" architecture, it was difficult to envision the existence of a single "section boundary" which could be used to segment the documents.
Therefore, segmentation of the documents in our data set was carried out using a visual, rather than structural strategy. When an author creates an unstructured hypertext document, it is often from a visual rather than structural perspective, and structural boundaries and DOM-based architectures (e.g. headings and tables) may be used and mis-used in order to create the desired visual effect.
The most obvious visual clues in segmenting a document are section headings, small chunks of text that denote the start of a new section and the end of a preceding section. As observed by Wynblatt et al in their attempts to generate an accurate "table of contents" for hypermedia documents based on section headings within the document (Wynblatt, Benson, and Hsu 1997), although the HTML specification provides the <Hn> series of markup tags, many document authors find these tags too abstract and prefer to express more control over the visual appearance of their section headings. Examination of the documents in our data set showed that this observation was true, and the first stage in our decomposition therefore, was to heuristically identify all the section headings in each web page. The use of horizontal lines through the markup tag <HR> also provided an obvious section boundary, which was more easily detected.
Many HTML documents contain many sub-regions without section titles which are implicitly implied in a visual context [Wynblatt and Goose 1998]. Structural features such as cell backgrounds, blank spaces and margins alert the visual user to conceptually separate items on a web page [Wynblatt et al. 1997]. Further scans of the documents in our data set therefore used structural clues such as these in order to derive further section boundaries. This automatic analysis produced a "skeleton" representation of the sections and links within the document – an intermediate form that could be examined by other processes in order to derive statistical observations about how links are being used within a particular document set.
In order to determine the nature of each identified section within a document a link density measure was calculated for each section. The link density of a section can be used to derive whether the section is a content or functional section [Wynblatt and Benson, 1997]. Functional sections usually have a high link density as opposed to the lower link density of content sections. The best method for measuring the link density of a document is to calculate the number of characters in the document which are contained within link anchors as a ratio of the total number of characters in the document. Other techniques for determining the link density, such as simply counting the absolute number of links, or counting the number of links per unit of text (for example per kilobyte of text) may mischaracterize documents (and hence sections) which are mainly content but have been diligently linked to related pages (such as those documents which we set out to investigate).
Wynblatt and Benson also point out that when calculating a link density measure, links anchored solely on a bitmap with no accompanying/alternative text should be counted as C characters towards both the number of characters within anchors for a particular section and the number characters in the section. It follows logically that an image map containing N links should contribute N * C characters to both totals. Applying this reasoning to our data set enabled us to identify sections containing only an imagemap to be correctly identified as functional sections.
Figure 3a: Sections per APOD article
Average Content: 1.34
Average Navigation: 1.9
Figure 3b: Sections per SCIAM article
Average Content: 8.2
Average Navigation: 5.5
Figure 3c: Sections per NewSci article
Average Content: 2
Average Navigation: 7
The results of the use of this algorithm for dividing the previous collections of pages into sections are shown above. It successfully divides the APOD articles into the two navigational regions (top and bottom) and the single middle content regions. The SCIAM pages are more complex from both a visual and markup point of view, but the decomposition is good. The results for the NewSci site however are very sensitive to the link density threshold, and can swap to give 8 content sections and only 1 navigation section.
Within each identified content section we can examine the links to determine the link anchors, i.e. the content that has been linked on. We can then investigate issues of consistency: given that a keyword appears in a link, is it always linked and is it always linked to the same place?
|Hubble space telescope||266||200||75||14||Diabetes||25||4||16||3|
The above table is a selection of the most frequent keywords based on the APOD and SCIAM pages as they exhibit significant linking of the content section. It shows clearly that neither site is consistent in whether or not individual terms are always linked nor where they are linked to. We investigated whether the inconsistency in the existence of links was due to appropriate destinations becoming available: before a particular time the keywords may be unlinked and after that time the keywords might consistently be linked. This was demonstrated not to be the case.
Sites that link separately and internally to related items and previous issues require only a simple metadata match to ensure that the general links provided are relevant. Sites that indulge in integrated content linking require both author input to create suitable links in the first place and editorial input to ensure consistency of approach over time. As an online (linked) version of a printed text, the SCIAM enhanced articles go through separate author and editorial processes. APOD pages, by contrast, are not written by independent authors but by the editors of the site. Also by contrast, each APOD page is written explicitly to be linked: the textual content is constructed to function as an abstract with links providing all the detailed or background information required by the various readership profiles. As such, the task of writing the hypertext is seen by the authors as easier than that of writing a plain text. This is because a plain text must express every idea and elaboration that is necessary to the understanding of the subject, whereas a hypertext can be written as a skeleton of the necessary information with links being used to add 'flesh' to the subject.
Perhaps that this kind of linking process (or more simply "writing process") is sufficiently at odds with people's normal experience of literacy explains that it is not more commonplace. The effort required to locate high quality potential Web pages to link to may be an issue here: competent editorial experience and a knowledge of the kinds of material available in a particular subject domain are key to this kind of linking, skills which the average Web site creator does not possess.
In trying to discover how Web page authors write hypermedia, we found ourselves having to invent algorithms and heuristics to read hypermedia. These algorithms can be used in Web applications which are not based on standard workstation environments: low-bandwidth hypertext environments, Web telephony applications and information presentation for the visually impaired. This section describes the access problems of typical mobile clients on the WWW, and how an "automatic Web page reader" can help present Web pages for these constrained user interfaces.
Much of the information available on the Internet is designed with large-bandwidth clients in mind, yet a large percentage of WWW clients are run over lower-speed 36.6K or 56K modems, and users are increasingly considering wireless services which run at even lower bandwidth. Current mainstream browser technology presents web pages on mobile and PDA devices as they would on a conventional sized display, without regard for the smaller display size. Although the majority of research in this area suggests that users will not be severely restricted when attempting to comprehend text on a smaller display [Jones et al. 1999; Duchnicky and Kolers 1983], research relating to interacting with text at a higher level than reading (such as the interaction with hypertext) suggests that difficulties do arise [Jones et al 1999; Schneiderman 1987].
The requirements of explicitly managing a smaller display and reduced bandwidth have prompted a number of solutions. Either a dedicated proxy service or an enhanced Web server can rewrite HTML pages and images into a form more suitable for the constraints of the browser [Gessler and Kotulla 1994; Fox and Brewer 1996, 3Com 1999 ]. New markup standards can be used in conjunction with HTML (or as a replacement for it) to explicitly express how Web pages should be rendered and displayed [SpyGlass]. Others address the increased difficulties of hypertext interaction by providing more information about the effect of following a link [Zellweger 1998; Stanyer and Procter 1999; Kopetsky and Mühlhäuser 1999].
Current solutions for the intrinsic problems created by low-bandwidth access to hypermedia resources with a restricted visual display focus on new protocols or markup schemas. WWW protocols assume users will interact with hypertext on a document-at-a-time basis, and this can be frustrating if the document required is particularly large or graphically intense.
A solution that we are pursuing is the use of our segmentation algorithm in order to make web documents available on a section-by-section basis, to enable low bandwidth users to retrieve the exact information that they require without having to retrieve the surrounding information in the document. The key principle of this approach is similar to that of the "web clipping" strategy implemented by 3COM’s Palm Computing on their latest PDA devices, but has a major difference. WEb CLipping works by offering the user access to a number of services offered by different sites (such as search as directory services, online banking). For each of these sites, a specialised application is constructed which receives the users query (such as keywords to search for, or bank account details), accesses the site over a high bandwidth connection and extracts the minimal required information based on assumed and agreed semantic rules (i.e. the application knows the form that the returned web page will take, and which parts it should extract). This limits the users access in that the choice of how and which sites to access lies with the developer and the service providers, rather than with the user. We hope to be able to put this decision in the hands of the user, by fetching a requested WWW page over a high-bandwidth connection, segmenting it using our generic algorithm, informing the user of the available sections, and returning the sections that a user finds interesting to the PDA device on a section-at-a-time basis (each section being "wrapped" into a single HTML document).
Our current prototype uses a small thumbnail of a website to inform the user of the general layout of the site and the available sections – this thumbnail image is generated by a web proxy with a high-bandwidth connection which acts on behalf of the user by retrieving and segmenting documents. The user is able to interact with the thumbnail image – moving the mouse pointer over a particular region of the image that corresponds to a section in the actual document causes information about the section to be displayed in the form of a "tooltip" (see Figure 2). The user is able to request a section of the web page by clicking on the appropriate area of the image. Simple "next" and "previous" options facilitate browsing on a casual section-by-section basis.
After requesting a web page, the proxy server returns a thumbnail image of the page, which is dislayed in the left-hand frame of the browser. Larger pages may be represented by a series of thumbnails. The user is then able to click on each region of the image in order to retrieve the corresponding section (displayed in the right-hand frame).
Moving the mouse pointer over the image reveals information about each section in the webpage. This information currently consists of the section title, type (content or navigation) and an estimation of the total size of all the elements inside the section (including the text itself). We also propose to add a visual indication of the extent of the section using a simple bounding box.
Future enhancements to this prototype include allowing the user to specify the granularity of the segmentation of a particular web page or section. For example, consider a section which is denoted visually by a section heading. This section may contain many paragraphs, and hence constitute considerable bandwidth and display requirements. By allowing the user to increase the granularity of the segmentation of this particular section, paragraphs can be quickly retrieved individually as required.
To combat the loss of text quality in producing a thumbnail image of a web page, we are also investigating the possibility of including a "keyword search" option to enable users to determine which section of the requested web page is most likely to best meet their needs. After entering one or more keywords, the sections best matching the query would be highlighted.
The issues that we tackle with this prototype about how a Web page uses its visual style to communicate its material are answers to the questions "Where is the content?" and "Where is the navigation section?" Having dealt with those we can start to address questions like "is the content well-linked or is navigation only handled separately?" That received wisdom is to carefully design separate navigation structures and to be wary of the use of linking [Renfeld and Morville 1998] suggests that the prevailing practice might be to avoid inline linking. Hypertext design research that focuses on the connections between 'atomic' pages also ignores the use of inline linking [Garzotto et al 1995]. However, these results should now allow us to describe well-linked pages in terms of patterns of links and to construct a Web search engine that will be able to perform queries based on those patterns.
Having analysed some relatively simple Web sites using this approach as we described earlier in the paper, we have now moved onto analysing more complex sites, such as the BBC Web site shown above. We have demonstrated that we can automatically distinguish between the content sections and infrastructure/navigation sections of a Web site, which enables us to identify the level of associative/inline linking used within the contents sections of the site. Once we have identified more sites that utilise associative/inline linking we will be in a better position to analyse linking styles in Web authoring practice and establish metrics for determining good and bad linking practice. For example, how important is consistency, or link density?
Another interesting application would be to analyse the pattern of citations in electronic journal papers. The reference to the citation in the text of an article links to the full citation at the end of the paper, and the full citation might link to the actual cited article itself. In DeRose’s taxonomy [DeRose, 1989], such links are called intensional, vocative links because they invoke a particular document or document element (in this case a citation) by name.
The hypermedia research community has always maintained that link authoring and maintenance is hard. Common practice today in the Web seems to confirm this, and since there is little or no support inherently in the Web for hypermedia authoring, it is hardly surprising that except in a few instances there is little of it in evidence. The open hypermedia community would argue that links should be reasoned about as first-class objects and stored in link databases to enable large-scale associative linking. New Web standards such as XML and Xlink reflect this thinking and maybe if they ever become widely used this will encourage the wider application of associative linking in the creation of large-scale Web sites. The increased used of metadata through standards such as RDF might also encourage this, since metadata tags can be used as a means of implicitly linking items of information. The application of software engineering metrics to hypermedia authoring might also lead to a better understanding of what makes linking creation, maintenance and reuse easier and more sustainable overtime. See for example [Mendes et al, 1998]. But our overall conclusion at the end of this paper has to be a rather depressing one. At the moment, there is very little real hypermedia on the Web except that created by authors of hypertext literature and some well edited sites. There is no reason why this has to be so but it requires a concerted effort on behalf of the Web standards community to ensure that the Web evolves to support hypertext in all its richness.
Thanks to Robert Nemiroff, co-producer of APOD, for insight into APOD's editorial and linking policies and processes. Thanks also to Mark Bernstein and readers of the Hypertext Kitchen for examples of Web sites displaying good hypertext practises.