David DeRoure and Steven Blackburn
{dder | sgb97r}@ecs.soton.ac.uk
Multimedia Research Group
Department of Electronics and Computer Science
University of Southampton
Southampton SO17 1BJ, UK
+44 (0)1703 592418
We have developed a set of component-based navigational hypermedia tools in order to investigate the application of open hypermedia to temporal media. Our goal is to identify issues which may influence the Open Hypermedia Protocol (OHP) design, since much of the experience in the open hypermedia community has been with text and images. The tools focus on audio as a case study and include support for content based navigation, specifically in music. This paper describes the tools, early experiences of using them, and raises some discussion points for OHP design.
The open hypermedia approach is well established but there is little experience of its application to temporal media. We have built a set of prototype tools to demonstrate and explore the issues involved in applying the principles of open hypermedia to audio, and we have extended our treatment to include content based navigation for musical content. Although our investigation concentrates on audio, many of the techniques are intended to be more general and we believe that the issues raised are relevant to other temporal media. The prototype tools were demonstrated at ACM Multimedia 1997 in Seattle.
We have adopted a component-based approach to the design and implementation of the tools as this allows individual components to be replaced with alternative implementations. The system is easily extended to interoperate with other open hypermedia systems by writing a component which communicates with the other system. The system is designed to be compliant with Open Hypermedia Protocol (OHP) and we are tracking the development of this protocol; our design also addresses interoperability with an existing system, the Distributed Link Service (DLS) [Carr; DeRoure]. We also draw on earlier work on hypermedia-multimedia [Hardman] and on the MAVIS system [Lewis] for our approach to content based navigation.
We call our project Amphion: according to Greek mythology, Amphion disliked the rougher life of his brother, a hunter and warrior. He preferred to accomplish by music what others did by force, easily moving rocks which his brother could not manage by physical means.
The design of our tools has been motivated by a series of scenarios in order to suggest requirements; the results of this exercise are disscussed in the following section. This is followed by an overview of the tools, an example of the tools in use and discussion of experiences and issues for OHP. We assume that the reader is acquainted with the principles of open hypermedia and the OHP design exercise.
The design of our tools has been motivated by a number of scenarios which have been identified through discussions with users. We use scenarios in the same way as the open hypermedia working group [OHSWG], where a real world problem is used to identify the requirements of a system which implements a solution.
Perhaps the most familiar opportunity for links within audio ('branching audio') or video occurs in structured presentations, e.g. lectures, documentaries or meetings. For example, these typically commence with an overview and, when playing back a recording (perhaps as a stream from a remote server), links can be made available from within the overview to the corresponding sections of the presentation. The user interface for the display of, and selection from, the available links is likely to be platform dependant; e.g. it would be different for a workstation, television, car radio and wearable computing device. Of course, it is also useful to link out from audio to related resources and indeed in from related resources.
As a case study, we are assisting historical researchers working with speeches, such as those of Winston Churchill. Here the particular requirements are a close coupling between the digital audio and the text transcriptions, linking to and from commentaries and finding occurrences of similar phrases by content based navigation based on text. We are also exploring related scenarios through recordings of meetings held using conferencing facilities such as the MBONE tools, providing a direct source of structured multimedia content which users wish to navigate. In general we believe structural information should be recorded at source and during production wherever possible, rather than obtaining it retrospectively by analysis (as we must do with historical archives).
Some of our other scenarios involve musical content. Interaction with musical structure usually occurs in the composition, production and publishing of music and is perhaps less familiar to many end users, although the techniques we propose make this a possibility. Where multiple versions of a performance are required, our approach is to treat each version as an alternative view on the performance; i.e. structure is 'first class'. Content based navigation has a role in facilitating the authoring of these structures, and plays a part in delivery for use by musical researchers and educators; for example, we envisage an interactive tutor which can find similar phrases or chord sequences and their associated links.
The scenarios raise a number of research issues. Presentations, such as a radio documentary, can be thought of as a guided tour through a branching structure (perhaps customized to the listener) which, by default, is 'pushed' as a stream; we propose that the user could interact in order to follow different routes. A close coupling is required between different media, for example audio and transcript or audio and MIDI. Together with structural and external links, the associations need to be preserved during conversion between media types (e.g. synthesis) and format conversion within a type (e.g. down sampling and compression). Audio information delivered as a stream may be transient. In general, we envisage the same multimedia content being delivered in very different ways in different situations.
The system architecture is shown in figure 1, and is followed by a description of the key components. An endpoint identifies the document and a position or duration within the document. Endpoints are communicated between the components using a very simple message which is compliant with URL syntax. Links consist of multiple source and destination endpoints; endpoints in links can carry alternative representations (e.g. text, thumbnails) so that it is possible to navigate through the hyperstructure without accessing the temporal media at all. These formats were adopted for simplicity during development and we intend that they will in due course be adapted for interoperability with SMIL, XLink and XPointer (see the W3C Web site for current versions of these recommendations).
Figure 1 - A diagram showing the system architecture.
Central to our toolkit is the Link Manager. This component receives messages from the other tools and resolves them using a link service (e.g. a local linkbase, or the DLS), presenting available links to the user via an appropriate interface component. Typically the input message contains information about the playback position in a particular piece of temporal media, or about a duration within the media. The document and duration may themselves be matched against source endpoints in the link resolution process, or for content based navigation a feature can be extracted from the content of the selection (i.e. generic links for multimedia content). Content based navigation is discussed further in the next section.
The Link Player is an example of a 'link-enabled application' and is a general purpose media player, essentially a wrapper for the OS multimedia capabilities. It plays a single multimedia file and can send location and duration information to the Link Manager. A message is sent as a result of user interaction, by pressing the link button, or automatically, e.g. at regular intervals or, in principle, at predefined points.
While the Link Player deals with one multimedia document, another link-enabled application, the Sequence Player plays a linear sequence of media fragments, possibly from different documents, according to a simple description (list of fragments). This tool was created in order to deliver a synopsis of existing presentations, and is an example of a multimedia presentation application controlled by a session description language. We anticipate that players for SMIL documents, e.g. GRiNS [Bulterman], will be used in this way.
One of the challenges of working with temporal media is that the data may be served from a dedicated streaming device, in contrast to text and image documents which are typically transported using a store-and-forward model. Furthermore, a user might join a session in mid-stream (analogous to tuning in to a radio or TV station) and leave at any time. For streaming work we have adopted Realtime Streaming Protocol (RTSP) [RTSP] and have constructed an RTSP server which can also describe and deliver streams of links. Our current interface to this is the Streaming Soundviewer, an evolution of the Microcosm SoundViewer [Goose].
Two other tools are not shown in the diagram above. The Link Hider embeds endpoint information directly within digital audio data. Although this appears to contravene the open hypermedia philosophy, there are situations where embedding is useful; e.g. where transport is restricted to one digital audio stream, and for editing with standard audio software. In fact embedding endpoint or link information in audio can be far less invasive than with text, as audio formats typically support multiple channels; the best example of this is perhaps MIDI, where 'link on...link off' information can be coded on a new MIDI channel as 'note on...note off' information.
Finally, to address issues of synchronization between multiple versions of the same document in different media (in this case audio and text) we developed the Audio Linker, an authoring tool which facilitates the creation of links between speeches and their transcripts. The tool is implemented as a special HTTP server and the interface is via a Web browser.
When linking on content rather than position, the position information is used to identify the content, and a feature extracted from it. We have adopted this approach because the link-enabled application might not deal with the multimedia data itself (as is the case with the Link Player). However, it is sometimes useful to deal directly with content (e.g. with streams and unaware applications) although this may sacrifice the document context.
We have employed the MIDI format as a useful abstraction of musical performance for our investigation into content based navigation: it can be closely associated with a digital audio representation via time-stamps and position pointers. We can convert digital audio to MIDI by pitch tracking, for which there are good monophonic solutions, but the polyphonic case is difficult. From MIDI we can construct pitch contours and representations of chord sequences.
Our first step towards content based navigation is the Content based retrieval tool, which performs matching of melodic features against a database of songs, providing both match position and relevance information. This functionality is broadly equivalent to what a Web search engine does for text, but with more information about the position of matches. Our prototype tool used pitch contours [Ghias;McNab], whereby a melody can be represented as a string of symbols from a small alphabet, and the Levenshtein string distance used to identify close matches in a melody database. For content based navigation, a linkbase is searched which associates a pitch contour with relevant documents.
Figure 2 shows a screenshot of some of the tools in use, using resources available on the Internet. The Link Player is playing a digital audio file (a song), the user presses 'link', and the Link Manager consults a linkbase and shows the possible destination endpoints. One of the links is to a text transcription of the song, another is a contour which is used to find songs with matching contours using the CBR tool.
Figure 2 - A screenshot of the tools in use
Our first experiments with the prototype tools included production of branching presentations by editing linear recordings of lectures, linking a historical speech to its transcript and to external documents, production of alternative views of musical performances, and demonstration of content based retrieval and navigation using a database of some 8000 documents. We have included video in our demonstrations by linking with the audio track. These experiments have provided proof of concept.
The tools have evolved in response to user feedback. For example, we have introduced file history, to allow similar functionality to page history in a web browser, and a context menu on the link button in the Link Player to facilitate rapid selection of endpoint size for queries based on content.
The requirements for the feature matching algorithm vary according to the nature of the data and activity. For example, a search of the database using an error-prone query (e.g. a contour extracted from humming) requires a different parameterization of the matching algorithm to searching for a MIDI selection in a linkbase.
Whereas for text documents the Open Hypermedia approach must assert the advantages of not embedding 'links', this is less of an issue for audio documents where there is no well established format with embedded links. In fact, on the contrary, our work suggest that endpoints and links can perhaps be embedded in audio formats whilst remaining 'open', thanks to multichannel formats.
This work raises the following issues for the protocol between the link manager and link services, i.e. for OHP:
This paper has shown that, while OHP provides mechanisms for supporting temporal media and content based navigation, there are some areas which require clarification. Addressing the issues raised will improve interoperability throughout the OHP components and increase the chances of achieving wide acceptance.