P. H. Lewis and M. R. Dobie
Traditionally, databases have been designed to store, retrieve and manipulate textual and numeric data. In recent years, sophisticated tools have been developed for searching, processing and presenting such information effectively in a database context. More recently, substantial effort has been directed to the development of databases for handling non-textual information, in particular, image databases and multimedia databases incorporating text, images, graphics, speech and video sequences . Although the emergence of hardware to support such systems has stimulated their development, including the very recent availability of viable digital video hardware, the software tools for supporting non-textual information lags a long way behind the tools available for text.
The equivalent of a free text search for an image database would be to find all images or video sequences containing a particular object by analysing the digital images themselves to establish if the object is present. This might be possible and straightforward if a high level symbolic description of the image is available, but necessitates substantial processing if only unprocessed raster images are held. Many unsolved problems in image understanding need to be solved before general solutions to such retrieval problems can be conceived.
Hence, we find that retrieval of images from multimedia databases is currently performed mainly through the use of keywords manually associated with the image or image sequence to indicate content. This approach has many problems; for example, images often contain large numbers of objects and different users may be interested in different items requiring completely different sets of keyterms. Truly flexible systems will only emerge when one is able to retrieve images or image sequences directly on the basis of content.
Although such flexible and general systems for image and video sequence retrieval are many years away, the combination of image processing and digital video hardware offers the users of multimedia databases the possibility of many useful tools for enhancement, retrieval and analysis of image sequences.
We have recently integrated a video disc player and a colour digital image capture and display system so that video sequences may be selected, digitised and displayed at video frame rates under software control. Start and end frame numbers for video sequences may be stored behind buttons in the windowing environment so that, when selected, the corresponding sequence is played through the image processing system. Chosen processing options may also be selected and a variety of standard tools made available immediately. For example, straightforward geometrical manipulations (often available as hardware functions on the image processing board) have been implemented easily via a user-friendly interface. These include panning, zooming, changing the resolution, rotating the frame and manipulating the colour look-up tables for each of the 3 colour channels separately to provide a variety of tools for image and video sequence enhancement. A more difficult challenge has been the development of tools which analyse image content and approaches to the problems of motion isolation and object tracking through image sequences are discussed in the following sections.
In video sequences, the objects of interest are often in motion and an initial aim was to develop a tool for highlighting or isolating the moving objects in a sequence. In other words, we required a tool to replace any cluttered background in the sequence by a homogeneous black or white frame against which the moving objects could be seen clearly. In the general case, where camera motion is permitted and no prior knowledge of the scene is available, the problem is essentially unsolved, but many approaches have been made to motion detection and may be adapted to provide useful tools [8,,9,7,6,,,4]. The general solution will require matching with 3-D models to provide reliable recognition .
Object motion is usually associated with change in the image function and it is instructive to look first at image differencing which has been the starting point for many motion detectors [9,5]. An implicit assumption in much of this work is that the camera remains stationary.
If f(x,y,ti) is the image function at time ti, the difference image is usually defined as:
Three important problems arise with the difference image approach.
The first of these problems is usually overcome by displaying from the image at time ti those pixels for which the absolute difference image values are greater than some significant difference threshold. This is illustrated in figure 1e but it is clear that the second and third problems still remain.
In cases where objects move clear of their previous positions from frame to frame, the second of the problems may be overcome by using the the following function to isolate the motion. Here we only consider areas for which the difference images with the previous and subsequent frames both have significant values.
If the image sequence involves objects in translation which do not move clear of themselves between frames, and we have prior knowledge of the relative intensities of objects and backgrounds, then the third problem may be overcome by performing similar analysis using a signed difference image as a starting point.
In order to overcome the third problem for more general situations, a background accumulation method has been implemented. The assumptions made are that the camera and lighting are stationary, there is a significant difference between background and object pixels and that all moving objects move clear of themselves within n frames, where n is set by the user but typically may be 3 or 4. Pixels which remain relatively constant for n frames or more are then assumed to be background. A first pass through the video sequence is used to accumulate an estimate of the background image, B(x,y), and the motion in subsequent frames is isolated by displaying from each frame those pixels which are significantly different from B(x,y). The algorithm is demonstrated for a test image sequence in figure 2.
In order to demonstrate the effect of the basic techniques, the images shown have had no noise reduction processing, although they would clearly benefit from it. Also, the methods do not operate at frame rates on our facilities, and they should either be used off line to create a new image sequence which may be viewed at the correct speed or new versions could be implemented in hardware.
A modification to this approach would be to update B(x,y) throughout the sequence, whenever a pixel position is found to satisfy the criterion for being a background pixel. Although computationally more expensive, this modified version could be expected to cope reasonably well with occasional changes in lighting or camera position except for a short hiccup while the background is reaccumulated after a change.
The ability to identify a specific object in images or image sequences and/or to track an object through a sequence would provide the basis for many useful tools for image sequence handling in a multimedia database context. Several levels of analytical tool might be envisaged which would put increasing demands on the image interpretation software required. For example,
We have started to implement versions of the first of these tools to provide tracking and highlighting of a specific object through a video sequence. Several operations and assumptions must be made in order to track an object given an indication of its location in the first frame. If one can assume that the object to be tracked is rigid, that the motion is in or near the image plane and that the relative translational motions between the object, camera and background are small with respect to the frame speed, the appearance of the object in each frame may be sufficiently similar for the correspondence problem to be solved by matching a template of the object in a search region surrounding the object location found in the previous frame.
In figure 3a the first, sixth, eleventh and sixteenth frames from a short video sequence of spirillum bacteria are shown. The object to be tracked was identified in the first frame using a mouse to draw around and highlight the general area of interest. This area was used directly to define the object template and a search area in the second frame was defined by expanding the template area a fixed amount in all directions. This process is equivalent to assuming a maximum image plane velocity for the object.
The template representation was based on a thresholded edge strength map in the specified area, derived from the average intensity of the individual RGB intensity values. The best match was selected as the one which minimised the sum of the absolute differences between the template and trial area values. In subsequent frames the search area was defined by expanding the best match area in the corresponding previous frame. After matching in each frame, the best match area was highlighted to display the tracking.
The search area could be reduced by making an assumption of constant image plane velocity to predict a focus of attention point about which a smaller search area could be created.
It can be seen from figure 3 that the chosen object has been tracked successfully through the sixteen frames in spite of the close encounter with another object and a small amount of rotation in the image plane. In these examples all forms of rotation are assumed to be absent. Rotation in the image plane may be accommodated fully in a straightforward but computationally expensive manner by rotating the template and recalculating rectangular coordinates, but rotation out of the image plane and other apparent changes in shape are, of course, more difficult to accommodate. If the change in appearance is small from frame to frame, it would seem possible to use the the best match from each frame as the template for the next, but initial experiments with this approach suggest that it is easily misled.
The template method is not totally robust, even in the absence of shape changes in the image plane. In Figure 3b an attempt is made to track the funnel of an engine through a short video sequence. Although the funnel was successfully tracked past the first trees, the tracking is thwarted when the background and object become almost indistinguishable. This problem may possibly be overcome by using more stringent motion constraints and optimising the matching over several frames. These approaches are currently being investigated.
As an alternative to template matching we are also investigating a method based on the generalised Hough transform  which can handle translation, rotation and size variations in the image plane and is more robust than template matching in the presence of occlusion. Briefly, the approach involves using the manually highlighted area in the first frame to generate a Hough R-table (see below and  for details) representing the object to be tracked, and for each subsequent frame a search area is identified and a Hough accumulator generated. The highest peak in the accumulator gives the location, rotation and size for the object in the frame. Tracking, highlighting and the use of motion constraints can then proceed as before.
An example of the use of the Hough method is given in figure .
In the next section we give more details of the generalised Hough transform approach and indicate how motion constraints may be used to restrict the size of the Hough accumulator.
In the first frame of the video sequence, the manually highlighted area is processed to give an edge map using 3x3 Sobel edge operators. The edge map is thresholded on edge strength, giving a set of strong edges to act as a fingerprint for the required object. For each significant edge point, the vector r from the edge position to a reference position in the highlighted area is calculated. A linked list of edge orientation and r value pairs is then formed to act as the Hough R-table.
For subsequent frames, a search space is defined around a focus of attention position and an accumulator is created with dimensions for the x and y coordinates of the reference position, the range of possible rotations and the range of possible scaling factors. The edge orientations for strong edges are estimated as for the first frame. For each strong edge point in the search area, a window is placed around its edge orientation value and the r values for all edge orientations in the window are obtained from the R-table and used to increment the accumulator. The cell value with the maximum increments then gives the location, rotation and scale of the object in the frame. The number of hits in the peak in the accumulator also gives an indication of the quality of the match. The maximum velocity constraint limits the search space and hence the size of the accumulator array along the reference position dimensions. Constraints on rotation and motion out of the image plane may also restrict the other dimensions of the accumulator which is a useful consideration as the potentially large size of the accumulator is one of the disadvantages of the approach.
The Hough approach was found to be considerably faster than the template matching approach and some simple calculations indicate that, providing the search space is relatively large compared to the object size, one would expect this to be the case.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 1 paper.tex.
The translation was initiated by Mark Dobie on 1999-03-04