S. T. Perry and P. H. Lewis The Multimedia Research Group Department of Electronics and Computer Science University of Southampton, SO17 1BJ, England.
Interactive segmentation, content based retrieval, content based navigation, extensible image viewer
Techniques for navigating through and retrieving text documents are widely understood and there are many tools that support their use. By contrast the extension to non-text information such as image, video, and sound is not widely supported at all. This is due to the fact that the fundamental operations of selection and matching are not as well defined as they are for text. Unlike textual media where an object of interest, such as a word or phrase, can be easily selected and separated from the background (the text file), delineating and extracting objects or regions of interest from images, video, or sound files may be a non-trivial task. Words in a text file are clearly delineated by white space or punctuation, but in many cases no such clear distinction exists between features and background in non-textual media. We consider it unrealistic to expect that foolproof automatic segmentation over a wide range of images will become possible in the near future, and so some form of interactive, user guided segmentation will be essential if we are ever to be able to handle objects in images as efficiently and reliably as we can handle text.
When selecting text in an application using some form of pointing device
such as a mouse, we typically select an initial character, and then drag
a bounding box around the area in which we are interested. When the
selection has been made the text within the area is usually highlighted
in some way to show that it has been selected, and is available for
further processing. If we were to extend this approach to handle the
selection of objects in images as well as of words or paragraphs in
text, we would expect to able to drag an area around the shape or
object, and for that object to then be selected. Unfortunately this is
where we encounter one of the fundamental differences between the
selection in text, and selection in images. When we are dealing with
text the system already has a high level representation of the object
that is being selected. It knows its shape, its boundaries
In their paper on Visual Information Retrieval (VIR)[1], Gupta and Jain identify nine different tools that may serve together as a VIR query language. The first of these is an image processing tool that should allow interactive segmentation and modification of images. Such a tool would seek to overcome the problems of selection in images, and allow the user to accurately extract objects for matching and retrieval. A common approach to interactive object identification is flood filling, which is one of the methods used by QBIC[2]. The main problem with this approach is that of `leaking', where the area that is filled expands past the boundary of the object, and so interactive pruning and blocking of the area is used to improve results. To overcome this problem of object selection, we have devised a method of allowing a user to extract shapes from images with the minimum of difficulty using a simple point and click interface. The system is implemented using the Generic Image Processor (GIP), a new image viewer which was purposefully designed to be readily extensible so as to facilitate the development of image processing tools, and to ease integration with existing systems. It incorporates a novel approach to viewer design that enables it to remain extremely small and efficient, and yet to be easily extensible in a wide variety of ways with the minimum of effort, and the minimum of disruption to anyone else using, or developing software with, the system.
GIP is described in more detail in the following section, and in section 3 we present our interactive object delineation technique which is based upon assisted splitting and merging of regions, that was developed as an extension to the GIP system. The ultimate goal for this work was to provide a mechanism for the easy extraction of objects from images to be used for Content Based Navigation (CBN) and Content Based Retrieval (CBR) in MAVIS[3], the Multimedia Architecture for Video, Image, and Sound, the use of which is briefly demonstrated in section 4. Finally, in section 5 we summarise the work that has been undertaken so far and outline some areas for future work and potential improvements to the GIP system, and our interactive object delineation method.
There are a great number of packages designed for the display and manipulation of images, ranging from relatively simple viewers such as XV, to complex image editing packages such the GIMP and Photoshop. These programs are generally not extensible, and are often often horrendously overlaiden with little used and/or unnecessary features, leading to a situation in which a different program is often used for viewing than is used for processing. Those large packages that are extensible (GIMP, Photoshop, etc.) typically use a plug-in system where code may be dynamically loaded into the program. This means that the code has to be written to a specific API, thus restricting the use of the code to that package and platform, and limiting the functionality of the plug-in to that which is provided by that API. By contrast, our viewer is designed from the outset to be as streamlined and as extensible as possible. It embodies the UNIX programming philosophy that complex systems should be built as a combination of several simpler programs, as this encourages reliability and decreases complexity of design. The system is based on the concept of a radically stripped down core viewer, which by itself provides very little functionality apart from the ability to display an image. It makes no attempt to understand a variety of image formats, it provides no facilities for image editing, or for image processing. It is, in short, an image viewer and nothing else. The key to GIP's extensibility comes through its reliance on external processes for all but the most essential of operations, and the manner in which it provides an easy and flexible way in which these processes may be used to enhance the system. The architecture of the GIP system is described in greater detail in the following section, with section 2.2 showing how basic functionality such as support for multiple file formats, and the processing of images is supplied. More complex processing of images, such as our interactive delineation technique, is enabled with the GIP module system as described in section 2.3
The one major facility possessed by the core viewer is the ability to run and to communicate with external processes, and these processes are the way in which extra functionality may be added to the base system. For example, when the viewer is asked to load or to save an image in a format that it does not understand, a process is used to convert the image into a format that it does understand. Similarly, if the user requires to perform some image processing operation on an image, an external process is used, and the result of the operation is displayed in a new window. In total, three types of process are used by GIP to enable it to provide a flexible and comprehensive array of services without becoming prohibitively large or overlaiden with unnecessary features. They are format filters, image filters, and image modules, each of which are described in more detail in the following sections.
When the viewer is started a configuration file is read which contains a
specification of the viewer menu hierarchy and lists the various filters
and modules that are available for use with the system. By altering this
configuration file, the set of operations available to the user from the
viewer can be changed to best suit the task at hand. Similarly, adding
a new operation to the viewer is simply a case of putting the process in
a place where it can be read by the viewer, and then adding it to the
configuration file. Typically, image processing operations will require
the setting of a number of parameters to be effective. In an
application, the separation of the interface from the processing is good
software design, and a number of systems exist where interfaces are
constructed from text or database information[4,5].
Rather than requiring each process to create its own interface, GIP
provides a method whereby it can create a dialogue from a simple
description in a text file. Processes that require such information can
specify in the viewer configuration file that a dialogue should be
displayed, and how the results from that dialogue should be passed as
arguments to the process. While this method does not allow full access
to the capabilities of the underlying user interface, it greatly
simplifies the process of interface construction, and as the interface
is stored separately from the process, enables existing command line
applications to be easily integrated into the system without any
changes. Of course if a process does require an interface beyond the
capabilities of the system, it is free to create its own.
GIP allows images to be processed using a filtering concept, with two types of filter being used by the system -- format filters, and image filters. Format filters are used to enable GIP to understand a wide range of image formats. When asked to load or save an image in anything other than its native format, the viewer looks for a format filter to handle the conversion. The format filters available for use with the system are specified in the viewer configuration file, and usually consist of two processes, one to convert to GIP's native format, and one to convert from it. When the viewer is asked to load an image in a format that it does not understand, a format filter process is started. This process reads the image and outputs it in the native format, which is read by the viewer and displayed in window. Similarly when an image is to be saved into a non-native format, a format filter process is started which reads the image from the viewer, and saves it to a file. By using external processes to handle these conversions the core of the system is kept as small as possible, and may be easily extended to cope with any new formats, without any recompilation, or even restarting of the viewer.
Image filters are used to provide an image to image processing
capability to the system, and enable the viewer to process images in a
wide variety of ways. Some typical image processing filters might edge
detect an image or perform a histogram equalisation operation, which can
be useful when some preprocessing of an image is required before a
segmentation or other operation. Given an image in a viewer window,
starting an image to image process causes the viewer to start a new
filter process and write the image to it. The process manipulates the
image in some way, and then outputs a result which is read by the viewer
and displayed in a new window. Fig. 3 shows the result
of some basic filtering operations on an image -- in this case a Sobel
edge detection, followed with edge enhancement.
Image modules are similar in concept to image filters, but are far more powerful as they allow a two way dialogue to take place between the viewer and the module. The viewer informs the module of events instigated by the user such as pointer movements and button presses, while the module is able to request services from the viewer such as displaying menus and dialogues, or overlaying some graphics on top of the displayed image. The user guided object delineation system described in section 3 is implemented as a GIP module. A number of modules may be run in a window, although only the module that is currently active will receive user events and have its output displayed.
Modules are the prime way in which the functionality of the system may be extended as they are effectively able to control an image window by responding to user events, and by updating the display. Several modules have been developed for use with the system that are not described in this paper. These include other interactive object delineation tools, such as a shape extraction module suitable for simple polygonal shapes, active contour modules[6] and statistical snakes[7], and also an MPEG player in which the module generates a stream of images which are displayed through the viewer.
In an ideal world it would be possible to use a mouse to point and click on an object in an image and that object would then be correctly segmented and highlighted, in much the same way as it is possible to click on a word in a text document. Unfortunately this is not yet possible, and is not likely to be for sometime except in circumstances where objects are in some way clearly distinct from their background, or if some prior knowledge of the objects which will be encountered is available. Our object delineation system is written as a module for the extensible image viewer GIP that was described in section 2, and it attempts to correct this problem by providing a number of tools that assist the user in the extraction of objects from images.
The GIP object delineation module consists of a number of tools which may be separated into two categories -- those that split an image or regions of an image into smaller regions, and those that take an image consisting of regions and reduce that number by merging. Using only these two fundamental processes it is possible to extract objects from a wide range of images, using a simple intuitive method of the iterative application of the splitting and merging procedures. This approach is somewhat different to the majority of segmentation systems in that the user retains control over the process as it happens and effectively guides the segmentation until the required result is obtained. It is, of course, unrealistic to expect any automated routine to correctly extract complete object boundaries across all images, as the boundary may simply not exist, and for this reason the final option in the delineation process is to manually edit the results of the segmentation and to correct any boundaries that cannot be satisfactorily extracted. In some simple cases the segmentation routines may be able to automatically extract the object correctly without any user interaction, but in the majority of cases at least some editing will be required. In the following sections we describe the methods by which regions may be split and merged, and in section 3.3 we give an example of the system in use.
The delineation process starts with a complete image which is broken down into a number of regions using a segmentation algorithm. These individual regions may then split further, or merged together using one of the merging tools. The following section describes the three segmentation algorithms that are currently available for use within the system.
Thresholding can be described as the transformation of an input image
f to a binary (segmented) image g such that:
(1) |
Two types of region growing algorithm may be used in the GIP
segmentation module. The first employs a traditional approach in which
the user selects a rectangular area of the image that is to be the start
of the region that is to be grown. All pixels considered similar to
those in the selected region are added to the region, and when no more
may be added, the remaining pixels are grown into additional regions
that do satisfy the criteria. The homogeneity function used by the
region growing algorithm is similar to that given in[10],
and is given below
A more useful interactive region growing algorithm is the Seeded Region
Growing (SRG)[11] algorithm. Based on conventional region
growing techniques, but in many ways bearing more resemblance to a
watershed algorithm, it starts with the selection of a number of seed
points to which a region growing technique is applied. Each of the seed
points is used as the initial member of a region
,
and the algorithm then proceeds adding one unassigned pixel to a
region in each iteration until there are no more unassigned pixels in
the image. This results in a tessellation of the image into the same
number of regions as given seed points.
The algorithm can be more formally explained as follows. If N(x) is
the set of immediate neighbours of the pixel x, then the set T of as
yet unallocated pixels which border at least one of the regions can be
written as:
Regions created using the segmentation routines may be merged together at any time using any of the three techniques described in the following sections. Two of these can be described as automatic in that they may be triggered directly after a region splitting operation has completed without any user intervention, while the third is entirely manual and must be instigated by the user.
Removing regions below a specified size is a simple, but very useful process. The threshold and non-seeded region growing segmentations may well result in a few large regions and a large number of very small regions, caused either by noise or by the fact that they are on the border of inclusion to more than one region. Removing them at an early stage is highly beneficial as they can adversely affect the performance of other operations due to their large numbers. Small regions are removed by merging them with their closest neighbour. The closeness of a neighbour is determined by examining the difference in mean grey levels between the two regions, with the most suitable neighbour being the one with the smallest difference from the region to be removed.
Two heuristics for merging regions based on the edge strength across
their boundary have been implemented : the Phagocyte heuristic and the
weakness heuristic[12]. For two adjacent regions R1 and
R2 we consider neighbouring pixels
and ,
on
either side of the boundary B. The weakness of the boundary at a given
point is then
(6) |
(7) |
(8) |
In this section we give a demonstration of the system being used to extract an aircraft from an image. Figs. 4 a to c show the six stages of the segmentation process, each of which are described below in greater detail.
As mentioned previously, the purpose for this work was to provide an
interactive object delineation capability for use with
MAVIS[3]. Using the GIP viewer and segmentation module it
is now possible to retrieve documents, and author and follow links based
upon the characteristics of the extracted object rather than the entire
image. At present such characteristics include colour distribution,
texture, and outline shape of the object, and may be used together in
any combination to be provide a range of query options. In
Fig. 5 an aircraft has been extracted from an image
using the GIP system, and a follow generic link query has taken place
based upon the colour distribution of the extracted aircraft. In the
image links window on the right, a number of links to related documents
can be seen that were returned by MAVIS as having similar colour
distributions. The user may select from the available links, and the
destination document of the link will be displayed.
In this paper we have presented our method for the interactive delineation of objects in images with a user guided, point and click, split and merge technique. The system has been implemented as an add on to our extensible image viewer, and has been demonstrated in use, segmenting an image, and acting as a front end to the MAVIS system.
Work continues on the development of both the GIP system and associated modules. While GIP is easily extensible using external processes, improvements can still be made to the module system by opening a wider range of services, primarily including communication between modules. Further work on the delineation module is expected to include the addition of more interactive region editing tools to further assist the extraction process and to improve the quality of the segmentations, as well as some improvements to the user interface.
Both authors with to thank the EPSRC; the first for the support of a research studentship and the second for support through grant GR/L 03446.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -address stp@ecs.soton.ac.uk -no_navigation -split 1 article.tex.
The translation was initiated by Stephen Perry on 1999-01-26