High Speed Low Complexity Guided Image Filtering Based Disparity Estimation
Charan Kumar Vala, Koushik Immadisetty, Amit Acharyya, Member, IEEE, Charles Leech, Vibishna Balagopal, Geoff V. Merrett, Member, IEEE, and Bashir M. Al-Hashimi, Fellow, IEEE,

Abstract—Stereo vision is a methodology to obtain depth in a scene based on the stereo image pair. In this paper we introduce a Discrete Wavelet Transform (DWT) based methodology for a state-of-the-art disparity estimation algorithm, that resulted in significant performance improvement in terms of speed and computational complexity. In the initial stage of the proposed algorithm, we apply DWT to the input images, reducing the number of samples to be processed in subsequent stages by 50%, thereby decreasing computational complexity and improving processing speed. Subsequently the architecture has been designed based on this proposed methodology and prototyped on a Xilinx Virtex-7 FPGA. The performance of the proposed methodology has been evaluated against four standard Middlebury Benchmark image pairs viz. Tsukuba, Venus, Teddy and Cones. The proposed methodology results in improvement of about 44.4% cycles per frame, 52% frames per second and 61.5% and 59.6% LUT and register utilization respectively, compared with state-of-the-art designs.

Index Terms—Guided Image Filtering, Stereo-Matching, FPGA, Discrete Wavelet Transform, Low-Complexity, High-Speed.

I. INTRODUCTION

STEREO matching, the task of matching images acquired by a pair of cameras (conventionally called reference and target images) and calculating the depth of objects in a scene, is being employed in sophisticated embedded vision applications including surveillance, autonomous vehicles and mobile robots [1]. A geometrical representation of the same is shown in Fig. 1. Since such systems are mostly used in mobile environments running from battery or harvested energy, the primary design-challenges are low power consumption and area-overhead reduction, whilst meeting the real-time processing requirement with acceptable precision. Hence the choice of matching algorithm and respective architecture implementation play an important role to tackle these challenges and meet these constraints for producing a viable embedded stereo matching system.

Thorough analysis of the existing literature shows that, despite having low-error rate in the disparity computation, state-of-the-art DSPs can not support global stereo matching algorithms due to intensive computational needs [2], [3]. A viable alternative to this could be GPU-based implementations but at the expense of high cost and power-consumption for real-time designs [4]. On the other hand, due to reduced algorithmic complexity, local stereo matching algorithms were getting implemented over the FPGA and ASIC platform to meet real-time requirements with higher frames per second compared to its global matching counter-part [5]–[7]. However the concern with the local matching algorithms is its low precision. To mitigate this issue, investigations have been made to implement dedicated hardware architectures of more precise algorithms, such as Semi Global Matching (SGM) [8], [9] and Adaptive Support Weight (ADSW) [10], [11]. For the past few years, hardware implementations predicated on SGM and ADSW algorithms have become the preferred solution towards higher matching precision in embedded vision applications [5], [7], [12], [13]. In addition, modifications and improvements have been made to implementations to habituate

Fig. 1. Geometry of a stereo vision system where a pair of parallel cameras capture the scene from horizontally disparate viewpoints. Given that the input images are rectified, the correspondence of a pixel at coordinate (x,y) of the reference image can only be found at the same vertical coordinate y within a disparity range ($D_{min}$ to $D_{max}$) in the target image.
the algorithms for real-time processing, with improvement in execution times over existing designs [14] but at the cost of increased error rates when compared to state-of-the-art software implementations [15]. Furthermore, high memory and significant need of hardware resources are the bottleneck towards the scalability of these designs to higher resolution images. We believe that efficiently reducing the number of samples to be processed at different stages of the algorithm will reduce the consumption of hardware resources, improving the scalability of these designs. With this motivation, we introduce the Discrete Wavelet Transform (DWT) and the Inverse Discrete Wavelet Transform (IDWT) at the pre and post-processing stages of the traditional Guided Image Filter (GIF) based disparity estimation process flow, which results in reduced processing-data sizes in the intermediate stages and thus less complex hardware in the FPGA implementation and throughput improvement. It is to be noted that the DWT has been exploited in state-of-the-art disparity estimation for extracting the feature points of the image in [16] and for improving the quality of disparity map in [17], [18]. To the best of our knowledge it has never been attempted in GIF based disparity estimation on hardware complexity reduction.

The GIF algorithm has been shown to reduce the complexity of the cost aggregation step in local ADSW algorithms [19], [20]. Recently published literature detailed the fully pipelined and parallel GIF architecture and corresponding FPGA based implementations [21], [22]. However, instead of designing a stand-alone architecture, we believe holistic optimizations of algorithm and architecture would lead to improved performance. With this motivation, in this paper we:

- Introduce a Discrete Wavelet Transform (DWT) based methodology for the disparity estimation algorithm.
- Design the respective architecture and prototype it on a Xilinx Virtex-7 FPGA and validate it against the standard databases, comparing the performance to the outcomes of existing designs.
- Show that the proposed methodology results in improvement of about 44.4% Cycles per Frame (CPF), 52% Frames per Second (FPS) and 61.5% and 59.6% LUT and register utilization respectively compared with the state-of-the art designs [9], [13], [23].

The rest of the paper is organized as follows: Section II provides the necessary theoretical background, Section III introduces the proposed methodology and subsequent architecture, Section IV discusses the experimental results, compares performance of the proposed methodology with existing designs and finally Section V concludes the discussion. The acronyms used throughout the paper are explained in Table I.

### II. RELATED WORKS

Stereo matching algorithms can broadly be classified into two categories: global and local [24]. Global algorithms are formulated as an energy minimization problem, which is solved with techniques such as Dynamic Programming, Graph Cuts and Belief Propagation. Such methods produce very accurate results at the expense of high computational complexity and memory needs. Semi-GLOBAL Matching (SGM) [8], [9], [15] methods renounce part of the accuracy by approximating a global 2D function using a sum of 1D optimizations from all directions through the image. SGM methods are therefore more affordable for dedicated hardware implementations, but still consume significant memory to store the interim cost of different aggregation paths. In contrast, local algorithms use block matching and winner-takes-all optimization to determine the disparity associated with a minimum cost function at each pixel [3]. Hence, they have lower computational complexity and memory requirements compared to global and SGM methods.

Among local algorithms, the most recent Adaptive Support Weight (ADSW) methods are currently the most accurate [2]. Despite their good results, ADSW algorithms cannot take advantage of the integral image or sliding window techniques, as the adaptive weights have to be recomputed at every pixel. This makes the cost aggregation’s hardware complexity directly dependent on the support window size. To improve matching accuracy, a few attempts have been made by combining different stereo algorithms together or by implementing modified versions of SGM and ADSW algorithms. A modified version of the census transform in both the intensity and gradient images, in combination with the SAD correlation metric has been implemented in hardware [6]. A stereo algorithm based on the neural network and Disparity Space Image (DSI) data structure is introduced in [7] and implemented on an FPGA. Zhang et al. combined both the mini-census transform and cross based cost aggregation for implementing real-time FPGA based stereo matching [25]. SGM-based stereo matching systems have been introduced in [8], [9], [15] and implemented on FPGAs and a hybrid FPGA/RISC architecture based platforms respectively. The VLSI design of an ADSW algorithm that adopted the mini-Census transform was implemented to improve the accuracy and robustness of the system to radiometric distortions [14]. Incorporating an ADSW algorithm and integration of pre and post-processing units, [11] proposed the implementation of a complete stereo vision system. Finally, a hardware oriented stereo matching
system based on the adaptive Census transform is presented in [10].

The aforementioned high-quality ADSW-predicated systems follow a homogeneous algorithm-to-hardware mapping methodology. The recently proposed Guided Image Filter (GIF) [19] has been employed in [20]–[22] to reduce the complexity of the cost aggregation step in ADSW methods, leading to a high-quality fast and simple algorithm consisting of the following steps: cost volume construction, cost volume filtering, disparity selection and disparity refinement. Cost Volume Construction (CVC) is a measure of calculating a cost between two pixels, one from the left image and one from the right image, to identify the disparity corresponding to the depth of a point in the scene [3]. The information obtained by matching single pixels is not sufficient for precise matching. So for accomplishing precise matching and to minimize the matching uncertainties, cost volume filtering is used in the next step. Following filter, the most probable disparity is selected by using a local winner-takes-all (WTA) strategy. Finally, a refinement step is used to reduce noise and improve the disparity map [3]. Recently, Ttofis et al. [21], [22] proposed a fully pipe-lined, parallel stereo matching FPGA-based hardware architecture based on the GIF, achieving real-time processing for high definition (HD) images.

III. PROPOSED METHODOLOGY AND ARCHITECTURE

A. Proposed Methodology

1) Overview: Fig. 2 shows the entire flow of the DWT based stereo vision algorithm, comprised of six stages. The proposed DWT is the first stage that receives the rectified stereo images captured from a stereo camera. The DWT transforms these images into approximate wavelet coefficients and feeds these coefficients into the second stage of CVC, which includes the Sum of Absolute Differences (SAD) and Gradient (GRD) stages. The GIF is used to filter each cost volume slice. In the fourth stage, the minimum among all cost volumes is selected by the WTA strategy for disparity selection. The first instance of the actual disparity (equivalent to the size of original image) is synthesized from the wavelet-domain using IDWT as part of the fifth stage, followed by post-processing in the sixth stage which includes left and right consistency check and median filtering.

2) Proposed DWT based methodology: In the context of stereo-vision algorithm, we propose to use a DWT as the first stage to reduce the number of samples required for subsequent computations [26]–[30]. Exploiting the property in stereoscopic images that neighboring pixels are highly correlated, without any loss of generality, the first resolution level of approximate wavelet coefficients can be considered for further computation and the detailed wavelet coefficients can be left apart, which results in simple down-sampling. This is evident in Fig. 2 where the original image, approximate wavelet coefficients and detailed wavelet coefficients are shown. When the approximate and detailed coefficients are compared, the approximate wavelet coefficients retain most of the features of the original image. Therefore, without sacrificing matching accuracy, the overall number of samples for subsequent computations can be reduced to half as compared to the original image. The Haar wavelet is chosen as the mother wavelet, because it can be implemented by using simple adder and shifter leading to a low-complexity hardware implementation. The approximate wavelet coefficients of the input images from the first resolution level, after applying a vertical 1D Haar mask [0.5, 0.5], can be represented as:

$$I_{left}^i(j) = \left( L^i(2j+1) + L^i(2j) \right)/2$$

and

$$I_{right}^i(j) = \left( R^i(2j+1) + R^i(2j) \right)/2$$

where $$I_{left}^i$$ and $$I_{right}^i$$ are the approximate wavelet coefficients of the input stereo images respectively (represented as $$L$$ and $$R$$), $$i$$ denotes the color channel in RGB space and $$j = \lfloor Hei/2 \rfloor \mod Wid$$.

At this stage the number of pixels of the input images to be processed are halved, i.e from $$(Hei \times Wid)$$ to $$(Hei \times Wid)/2$$ and thereby significantly reducing the computational complexity of subsequent modules.

3) Processing of the Approximate Wavelet Coefficients using GIF Based stereo matching [20]: CVC involves the cost computation of each pixel between the stereo images over a range of disparities $$d$$ by considering the truncated absolute difference of colors and gradients as follows:

$$C(p, d) = a \cdot \min(T_c, M(p, d)) + (1 - a) \cdot \min(T_g, G(p, d))$$
where,
\[ M(p, d) = \sum_{i=1}^{3} | I_{l=eft}^i(p) - I_{right}^i(p - d) | \]  \hspace{1cm} (4)

and
\[ G(p, d) = | \nabla_x I_{l=eft}^i(p) - \nabla_x I_{right}^i(p - d) |, \]  \hspace{1cm} (5)

where \( i \) denotes the color channel in RGB space, \( \nabla_x I_{l=eft}^i \) denotes the gradient in \( x \) direction computed at pixel \( p \). \( a \) is used to balance the influence of the color and gradient terms, \( T_c, T_g \) are the truncation thresholds respectively.

Next, the GIF is used to filter the cost volume. The filtered cost of a pixel \( p \) at disparity-\( d \) is given by:
\[ q(p, d) = \sum W_{i,j}(p) C(p, d) \]  \hspace{1cm} (6)
where,
\[ W_{i,j} = \frac{1}{|w|^2} \sum_{k:|(i,j)\in w_k} \left( 1 + \frac{(I_k - \mu_k)(I_j - \mu_k)}{\sigma_k^2 + \varepsilon} \right) \]  \hspace{1cm} (7)

In (6) and (7), \( i \) and \( j \) are pixel indices, \( W_{i,j} \) is the weight corresponding to \( j \) in the window \( W_k \) with center \( i \) and radius \( r \), \( |w| \) is the number of pixels in \( W_k \) (i.e., \( (2r+1)x(2r+1) \)), \( \mu_k \) and \( \sigma_k \) are the mean and the standard deviation of \( I \) (the first resolution level of wavelet approximate coefficients of the guidance image) in \( W_k \). The weights \( (W_{i,j}) \) can be computed using linear operations [19] which can be decomposed into a series of mean filters with windows of radius \( r \). The pseudo code of the GIF is given in Algorithm 1, where \( f_{mean} \) is a boxfilter (same as a mean filter) with a window of radius \( r \). The abbreviations of correlation (\( corr \)), variance (\( var \)), and covariance (\( cov \)) hold their usual meaning.

Disparity selection, involving the condensation of the cost volume back into a single image, is performed through the WTA strategy where \( d \in [D_{min}, D_{max}] \):
\[ d_p' = \text{argmin} q(p, d) \]  \hspace{1cm} (8)

4) Proposed IDWT based disparity estimation: The first disparity map, \( d_p' \), that is the size of the original image is synthesized from \( d_p' \) by applying the IDWT as follows:
\[ d_p(2k) = (d_p'(k) + d_p'(k + 1))/2 \]  \hspace{1cm} (9a)
\[ d_p(2k - 1) = d_p'(k) \]  \hspace{1cm} (9b)

![Fig. 3. Computed disparity in the wavelet domain (left) and the up-sampled disparity after IDWT (right).](image)

![Algorithm 1 Guided Image Filter [20]]

**Algorithm 1** Guided Image Filter [20]
1: INPUT: guidance image \( I \), guided image \( p \)
2: \( mean_l = f_{mean}(I) \), \( mean_p = f_{mean}(P) \)
3: \( corr_l = f_{mean}(I \ast I) \)
4: \( corr_p = f_{mean}(I \ast P) \)
5: \( var_l = corr_l - mean_l \ast mean_l \)
6: \( var_p = corr_p - mean_p \ast mean_p \)
7: \( mean_a = f_{mean}(a) \), \( mean_b = f_{mean}(b) \)
8: \( q = mean_a \ast I + mean_b \)
9: OUTPUT: \( q \) (filtered image)

![Algorithm 2 Proposed Disparity Estimation Algorithm]

**Algorithm 2** Proposed Disparity Estimation Algorithm
1: INPUT: pair of stereo images \( L(\text{left}), R(\text{right}) \)
2: for \( k = R,G,B \) do
3: for \( i = 1 \) to \( Hei/2 \) do
4: for \( j = 1 \) to \( Wid \) do
5: \( I_{left}(i,j,k) = (L(2i-1,j,k)+L(2i,j,k))/2 \)
6: \( I_{right}(i,j,k) = (R(2i-1,j,k)+R(2i,j,k))/2 \)
7: end for
8: end for
9: end for
10: end for
11: end for
12: for \( Disp = D_{min} \) to \( D_{max} \) do
13: for \( i = 1 \) to \( Hei/2 \) do
14: for \( j = 1 \) to \( Wid \) do
15: \( \text{Find the SAD} \)
16: \( \text{Find the gradient} \)
17: \( \text{Find the cost} \)
18: end for
19: end for
20: end for

1: Cost Volume Construction:
2: for \( CVF = D_{min} \) to \( D_{max} \) do
3: Repeat GIF Algorithm 1
4: end for
5: Disparity Selection:
6: for \( CVF = D_{min} \) to \( D_{max} \) do
7: \( d_p = \text{argmin}_d q(p, d) \)
8: end for
9: Up-sampling of disparity:
10: for \( i = 1 \) to \( Hei/2 \) do
11: for \( j = 1 \) to \( Wid \) do
12: \( d_p(2i,j) = d_p(i,j) + d_p'(2i-1,j) \)
13: \( d_p(2i-1,j) = d_p'(i,j) \)
14: end for
15: end for
16: Post Processing:
17: Left-right consistency check and filling
18: Disparity refinement with a median filter
19: OUTPUT: Final Disparity

where \( k = 1 : Hei/2 \forall Wid \). A demonstration of the up-sampling strategy for a 16-pixel template is shown in Fig. 2. To reduce noise and improve the quality of disparity map, a post-processing step is performed. This involves a left-right (L-R) consistency check. The L-R check requires the computation of both the left and right disparity maps. Pixels in the left disparity map are marked as inconsistent if the disparity value of its matching pixel in the right disparity map differs by more than one pixel. Inconsistent pixels are then filled by the disparity of the closest consistent pixel [20] and a median filter is used to smooth the filled regions and remove spikes. The pseudocode of the proposed DWT based disparity estimation methodology is given in Algorithm 2.
B. Architecture based on proposed methodology

Fig. 4 shows the disparity estimation architecture based on the proposed methodology. It can be observed that the architecture comprises of DWTs, RGB to gray scale image converters, gradient computation modules, Cost Computation Units (CCUs), Guided Image Filtering Units (GIFUs), WTAs, Upsampling modules, LR-check module, median filter and a controller. All the modules are fully pipelined to obtain high throughput. The rectified reference and target image’s RGB data enters the processing pipeline consisting of the modules mentioned above. After multiple setup cycles from the pipeline latency, computed depth maps synchronized with the input pixel rate are forwarded successively through the scan-line to the output port.

For the purpose of demonstrating the proof of concept, 6 DWT modules (one for each color channel) for left and right R,G,B color channels based on the Haar wavelet (Equations 1 and 2) have been designed, each consisting of one adder and one right shifter. The outputs of DWT module are stored in a BRAM. At this stage, the number of samples is reduced by half. The gradient module comprises of 6 subtracters for calculating the left and right RGB gradients in the X-direction. The gradient values of the left and right images are then stored in the buffer. Furthermore, these values are accessed in the next stage for CVC.

The inputs to the cost volume construction unit (CVCU) are sent through a buffer, Fig. 5a, where two shift registers (length $D_{max}$) send data to the corresponding CVC module through a 2x1 mux depending on the select signal. The shift registers take the data from the input RGB BRAM alternately depending on the enable signal. All the remaining buffers in the design are realized using the FPGA’s internal BRAMs.

The CVCU in Fig. 4 employs a cascade of CCUs to calculate the pixel-wise cost between the pixel $p$ in the left image and the corresponding $(p - d)$ pixel in the right image. The architecture of CCU, based on Equations (3) - (5), is shown in Fig. 5b. It consists of absolute difference units, adders and comparators that calculate the truncated color and gradient costs, which are summed to compute the overall cost. Prior to the summation, the truncated color and gradient costs are multiplied by constant values to normalize the overall cost. To avoid multiplication and without any loss of generality, these constants are selected to be powers of 2, which can be realized with shifting operations alone.

The output of the CCU is fed into the cost volume filtering unit (CVFU) in a row-wise manner. The CVFU employs a cascade of GIFUs. The architecture of GIFU is based on the pseudocode of Algorithm 1 and is shown in Fig. 5c. Each GIFU comprises boxfilters, subtracters, an adder and a shifter. Among all the components in GIFU, the boxfilter is the computationally intensive component. However, since the number of input samples to be processed has been reduced by the DWT module, the boxfilters benefit from the reduction in the number of computations per image.

The architecture of the boxfilter is shown in Fig. 6a. The boxfiltering operation for a 6x7 image template with window radius 1 is illustrated in Fig. 6b, and is explained as follows: If pixel-i of an image is updated then the mean of the box with pixel-i as end pixel is given by the boxfilter. The basic operation of the boxfilter is to maintain each column sum by adding the $(2r+1)$ pixels. The box sum is computed by adding the $(2r+1)$ adjacent column sums. From Fig. 6b, if pixel-i is updated, then the corresponding column sum is updated by adding and subtracting the new and old pixels respectively. In the same way, the box sum is computed by adding and subtracting the new and old column sums respectively. Storing
only \((2r + 1)\) rows of pixels for boxfilter operations would be sufficient, resulting in reduction of memory consumption from entire image to just \((2r + 1)\) rows of pixels. The boxfilter architecture (Fig. 6a) has two submodules; the column sum computation unit (where the column sum to be updated is computed) and the box sum computation unit (where the box sum to be updated is computed).

The operation of the GIFU for a \(6 \times 7\) image template and window radius 1 is shown in Fig. 7 and is explained according to the pseudocode of the GIF (algorithm 1) as follows: If pixel-i of an image is known then the mean of box with pixel-i as end pixel is computed by a box-filter (as described above). Similarly, from Fig. 7 as pixel-i of I (guidance image) and P (guided image) are updated, II and IP of pixel-i are computed by two multipliers. \(\text{mean}_I\), \(\text{mean}_P\), \(\text{mean}_{II}\), \(\text{mean}_{IP}\), \(\text{var}_I\), \(\text{cov}_{IP}\) of pixel-j are computed by four box filters, four multipliers and two subtracters. Consequently, the a and b values of pixel-j are computed by a shifter, multiplier and subtracter. In a similar way, since the a and b values of pixel-j are known, the mean of the box with pixel-j as the end pixel, i.e \(\text{mean}_a\), \(\text{mean}_b\) of pixel-k are computed by two box filters. Finally, q (i.e., filtered output) of pixel-k is computed by a multiplier and an adder. Thus the entire architecture of the CVFU is pipelined in parallel.

Two WTA modules, shown in Fig. 4, are used for the left and right disparity selection respectively. Each WTA module comprises of comparators organized in tree structures with \(\log_2 D_{max}\) stages. The disparities obtained are stored in the buffers before being transferred to the up-sampling units which consist of an adder and shifter. The up sampled disparities are stored in the buffers DL and DR, as shown in Fig. 4, from where they are accessed by the post processing unit. The
Fig. 7. The operation of the Guided Image Filtering Unit (GIFU).

Post Processing module implements the L-R check and filling (interpolation) through a set of comparators, multiplexer trees, and priority encoders that locate the nearest valid disparities in the left and right direction of the invalid pixels. To reduce noise and eliminate any remaining spikes in interpolated disparity map, an energy-efficient 3x3 median filter design has been adopted from [31], [32] as shown in Fig. 6c, where $A_0$, $A_1$, $A_2$ are the input disparities. The reason for choosing a 3x3 median filter is explained in Section IV.

IV. RESULTS AND DISCUSSION

To validate the proposed methodology, the stereo matching core has been implemented on a Xilinx Virtex-7 FPGA, with system parameters set to the constant values: \{r, ε, $T_c$, $T_g$\} = \{3, 0.7, 2\} adopted from recent literature [20]–[22] and are used throughout our experiments. The metrics used to evaluate the proposed methodology are; error (nonocc, all, disc) (Table II), frames-per-second (FPS), millions of disparity estimations per second (MDE/s), cycles per frame (CPF) and hardware resource utilization (LUTs, slice registers, DSPs, BRAMs).

The performance of the system designed based on the proposed methodology has been evaluated against the standard Middlebury benchmark that is widely used in evaluating the quality of the stereo-matching algorithms. The four image pairs viz. Tsukuba, Venus, Teddy and Cones are processed by the proposed system and the results are shown in Fig. 8(a-l). A performance comparison of the proposed system with state-of-the-art designs in terms of error (i.e, the percentage of bad matching pixels when compared to the ground truth) is listed in Table II. From this table, it can be observed that the error of the proposed methodology is 5.71% in non-occluded regions, 9.89% across the whole disparity map, and 13.58% in discontinuous regions, resulting in an accuracy of 90.27% (calculated as 100 - average bad pixel percentage). To further validate the proposed methodology rectified real-time stereo images, captured using a stereo camera, are processed. The corresponding results are shown in Fig. 8(o,p). It can be seen that the disparity is sharp for clear edges.

The proposed methodology is verified by using different wavelets (db2, sym2, coif1, dmey) and without applying the DWT. The disparity maps obtained for the Tsukuba stereo pair are shown in Fig. 9a and the corresponding overall error graph is shown in Fig. 9b. The error for all the wavelets vary from 6.72% to 7.9%, which shows that any of the wavelets can be used for the proposed methodology as they are in acceptable range as per Table II. Since hardware complexity should also be considered, the Haar wavelet has been used in this paper as proof of concept as it has a low-complexity hardware implementation, using a simple adder and shifter.

The proposed algorithm is verified for different sizes of median filter (from 3 to 11 with step-size of 2) and the disparity maps are shown in Fig. 9d and the corresponding error of the obtained disparity maps are shown in Fig. 9c. Disparities obtained from the median filter with a smaller window size have a lower error compared to large sizes due to over smoothing. The complexity of the median filter increases as the window size of the median filter increases. Hence the median filter with a window size of 3x3 has been chosen for the implementation. The disparity map is obtained in two different cases, firstly by applying DWT in both dimension of the image (2D-DWT) and secondly by taking the second level approximate coefficients which are shown in Fig. 8(m,n) for the stereo image Tsukuba. In both cases, the computational complexity has been reduced, since the number of input samples is reduced by 75%, but at the cost of higher error rate.
Fig. 8. Left images: (a) Tsukuba (b) Cones (c) Venus (d) Teddy; (e)-(h) Ground truth, (i)-(l) Outcome from the proposed methodology, Obtained disparity by applying (m) 2nd level DWT, (n) 2D DWT, (o) Real time left stereo image (p) Obtained disparity using proposed method.

of 26% for 2D-DWT and 22.6% for the second level DWT.

Fig. 10a shows a comparison of the proposed methodology with the state-of-the-art methodology of [20] in terms of computational complexity (i.e., number of pixels processed at every stage of algorithm). The x-axis shows the different stages of algorithm - DWT, CVC, GIF, WTA, IDWT and post processing as described in Section III-B, and the y-axis shows the number of pixels processed at every stage (where 1 is equivalent to size of an input image). The computational complexity of the proposed methodology is significantly reduced due to the reduction in the number of samples at the initial stage of the algorithm due to the application of the DWT. Fig. 10b gives the comparison of the latest designs [9], [22], [23], [35] with the proposed methodology in terms of the number of cycles needed for computing the disparity of a single frame ($CPF = \frac{Frequency}{FPS}$). The performance of proposed methodology does not degrade with the increase in image size.

Table III gives the comparison of the proposed methodology with related works in terms of quality (error), performance and hardware resource utilization. The resource utilization includes post-processing after FPGA-based optimizations. LUT utilization of proposed system is the least of all the approaches at the expense of 3.1 higher error on average compared to state-of-the-art designs [9], [21], [22], [35]. The error incurred is attributed to the application of the DWT and computing the approximate coefficients from the first resolution level of the DWT. Table IV provides the FPGA resource utilization for the system designed based on the proposed methodology in terms of LUTs, Registers, DSPs and BRAMs where core $\{16, 32, 64\}$ is the maximum disparity range number. Module-
TABLE III
QUALITY, SPEED AND ERROR COMPARISON WITH THE RELATED WORKS. (N.A = NOT APPLICABLE, N.M = NOT MENTIONED)

<table>
<thead>
<tr>
<th>Work</th>
<th>Image</th>
<th>$D_{max}$</th>
<th>Speed (fps)</th>
<th>MDE/s (10^6)</th>
<th>Error</th>
<th>platform</th>
<th>LUT’s</th>
<th>Slice Registers</th>
</tr>
</thead>
<tbody>
<tr>
<td>ADSW [38]</td>
<td>320*240</td>
<td>30</td>
<td>0.01</td>
<td>0.0263</td>
<td>6.53</td>
<td>CPU</td>
<td>N.A</td>
<td>N.A</td>
</tr>
<tr>
<td>Chang [14]</td>
<td>352*288</td>
<td>64</td>
<td>42.5</td>
<td>272.5</td>
<td>N.A</td>
<td>ASIC</td>
<td>N.A</td>
<td>N.A</td>
</tr>
<tr>
<td>Jin [5]</td>
<td>640*480</td>
<td>64</td>
<td>230</td>
<td>4522</td>
<td>16.5</td>
<td>FPGA</td>
<td>114214</td>
<td>N.M</td>
</tr>
<tr>
<td>Ambrosch [6]</td>
<td>750*400</td>
<td>60</td>
<td>60</td>
<td>1080</td>
<td>12.5</td>
<td>FPGA</td>
<td>139606</td>
<td>N.M</td>
</tr>
<tr>
<td>Banz [15]</td>
<td>640*480</td>
<td>128</td>
<td>103</td>
<td>4050</td>
<td>6.7</td>
<td>FPGA</td>
<td>68427</td>
<td>N.M</td>
</tr>
<tr>
<td>Ding [11]</td>
<td>640*480</td>
<td>60</td>
<td>51</td>
<td>940</td>
<td>11.9</td>
<td>FPGA</td>
<td>50585</td>
<td>35020</td>
</tr>
<tr>
<td>Hosni [20]</td>
<td>640*480</td>
<td>26</td>
<td>25</td>
<td>200</td>
<td>5.57</td>
<td>GPU</td>
<td>N.A</td>
<td>N.A</td>
</tr>
<tr>
<td>Zhang [25]</td>
<td>1024*768</td>
<td>64</td>
<td>60</td>
<td>3019</td>
<td>8.20</td>
<td>FPGA</td>
<td>53095</td>
<td>74109</td>
</tr>
<tr>
<td>MCADSR [35]</td>
<td>1024*768</td>
<td>128</td>
<td>129</td>
<td>13076</td>
<td>7.65</td>
<td>FPGA</td>
<td>60160</td>
<td>33291</td>
</tr>
<tr>
<td>Perri [10]</td>
<td>640*480</td>
<td>60</td>
<td>45</td>
<td>829</td>
<td>10.09</td>
<td>FPGA</td>
<td>N.M</td>
<td>80270</td>
</tr>
<tr>
<td>Jin [23]</td>
<td>1024*768</td>
<td>60</td>
<td>199.3</td>
<td>9362</td>
<td>6.05</td>
<td>FPGA</td>
<td>122900</td>
<td>N.M</td>
</tr>
<tr>
<td>Ttofis [13]</td>
<td>640*480</td>
<td>64</td>
<td>60</td>
<td>1179</td>
<td>9.79</td>
<td>FPGA</td>
<td>88791</td>
<td>117260</td>
</tr>
<tr>
<td>Christos [22]</td>
<td>1280*720</td>
<td>64</td>
<td>60</td>
<td>3538</td>
<td>6.80</td>
<td>FPGA</td>
<td>57492</td>
<td>71192</td>
</tr>
<tr>
<td>Wenqiang [9]</td>
<td>1024*768</td>
<td>96</td>
<td>67.8</td>
<td>10472</td>
<td>5.61</td>
<td>FPGA</td>
<td>125255</td>
<td>81092</td>
</tr>
<tr>
<td>Proposed</td>
<td>1280*720</td>
<td>64</td>
<td>103</td>
<td>6075</td>
<td>9.73</td>
<td>FPGA</td>
<td>34181</td>
<td>47368</td>
</tr>
</tbody>
</table>

Fig. 9. (a) Obtained disparity for different wavelets where, 'db2' belongs to Daubechies family of wavelets, 'sym2' belongs to Symlets family of wavelets, 'coif1' belongs to Coiflets family of wavelets, 'dmey' belongs to Symlets family of wavelets, (b) Obtained disparity error for different wavelets, (c) Obtained error for Tsukuba image for different size of median filter (x-axis, y-axis represents window size of median filter and error when compared with ground-truth disparity respectively), (d) Error analysis by varying median filter widow size $r_m$ over the range {3, 5, 7, 9, 11, 13}.

A comparison of the system designed in terms of frequency, power and FPS is given in Table VII. A 52% improvement in FPS is observed when the proposed methodology is compared with one of the recently reported design [9]. In particular LUT utilization is reduced by 61.5%, 40.5%, 43.2%, 72.7%, 35.6% when compared to latest reported results in [9], [13], [22], [25], [35] respectively. Slice registers utilization is reduced by 59.6%, 33.46%, 41.6%, 36% when compared to [9], [13], [22], [25] respectively. A detailed comparison is given in Table VIII and Fig. 10c.
TABLE IV
FPGA RESOURCE UTILIZATION BY THE SYSTEM DESIGNED BASED ON THE PROPOSED METHODOLOGY.

<table>
<thead>
<tr>
<th>Design</th>
<th>LUT</th>
<th>Slice register</th>
<th>DSP 48E</th>
<th>BRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>Stereo Matching core-16</td>
<td>8709</td>
<td>12561</td>
<td>68</td>
<td>36</td>
</tr>
<tr>
<td>Stereo Matching core-32</td>
<td>18256</td>
<td>24157</td>
<td>129</td>
<td>120</td>
</tr>
<tr>
<td>Stereo Matching core-64</td>
<td>34181</td>
<td>47368</td>
<td>273</td>
<td>247</td>
</tr>
</tbody>
</table>

TABLE V
MODULE-WISE RESOURCE UTILIZATION

<table>
<thead>
<tr>
<th>Module Name</th>
<th>LUT</th>
<th>Slice register</th>
<th>DSP 48E</th>
<th>BRAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>DWT</td>
<td>25</td>
<td>42</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>CVC</td>
<td>9328</td>
<td>5320</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>GIF</td>
<td>21053</td>
<td>39953</td>
<td>273</td>
<td>228</td>
</tr>
<tr>
<td>WTA</td>
<td>494</td>
<td>275</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>LR-Check Filling</td>
<td>515</td>
<td>720</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Median filter</td>
<td>366</td>
<td>328</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

V. CONCLUSION

In this paper we introduced a DWT based methodology into a state-of-the-art disparity estimation algorithm, resulting in a significant performance improvement in terms of speed and computational complexity. The performance of a system designed from the proposed methodology has been evaluated against standard Middlebury benchmarks that are widely used in evaluating the quality of stereo matching algorithms.
The four image pairs of Tsukuba, Venus, Teddy and Cones have been used to test the proposed system. It has been demonstrated that the system achieves an improvement of 44.4% cycles per frame, 52% frames per second and 61.5% and 59.6% LUT and registers utilization respectively on an FPGA compared with state-of-the-art designs. We believe that our system has significant impact for applications in autonomous vehicles and mobile robotics by meeting real-time processing requirements in a resource-constrained scenario.

REFERENCES


TABLE VIII
HARDWARE RESOURCE UTILIZATION COMPARISON WITH STATE-OF-THE-ART DESIGNS (N.M= NOT MENTIONED)

<table>
<thead>
<tr>
<th>Work</th>
<th>Image Size</th>
<th>LUT</th>
<th>Slice Register</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jin 2014 [23]</td>
<td>1024x706</td>
<td>122900</td>
<td>N.M</td>
</tr>
<tr>
<td>Shan 2014 [35]</td>
<td>1024x706</td>
<td>53291</td>
<td></td>
</tr>
<tr>
<td>Wang 2015 [9]</td>
<td>1024x768</td>
<td>125255</td>
<td>81092</td>
</tr>
<tr>
<td>Christos 2016 [22]</td>
<td>1280x720</td>
<td>57492</td>
<td>71192</td>
</tr>
<tr>
<td>Proposed</td>
<td>1280x720</td>
<td>34181</td>
<td>47368</td>
</tr>
</tbody>
</table>
tree-structured dynamic programming on fpga,” in Proceedings of the
ACM/SIGDA international symposium on Field Programmable Gate
[37] Y. Shan, Z. Wang, W. Wang, Y. Hao, Y. Wang, K. Tsoi, W. Luk, and
H. Yang, “Fpga based memory efficient high resolution stereo vision
system for video tolling.” in Field-Programmable Technology (FPT),
[38] K.-J. Yoon and I. S. Kweon, “Adaptive support-weight approach for
correspondence search.” IEEE Transactions on Pattern Analysis and

Charan Kumar Vala received his masters degree in microelectronics and very large scale integration (VLSI) from Indian Institute of Technology Hyderabad, INDIA. Currently he is a research scholar at School of Electronics and Computer Science, University of Southampton, U.K. His research interests include VLSI architectures, FPGA implementation, VLSI for Cyber physical systems, low-power design techniques, and analog/mixed-signal ASICs. He is runner-up in all India Cadence Circuit Design Contest.

Koushik Immadisetty received the B.Tech in Electrical Engineering from Indian Institute of Technology Hyderabad, India, in 2017. His research interests include Efficient algorithms and architectures for image/video processing, stereo vision, VLSI architectures, low-power design techniques. He is currently working in Qualcomm Innovation Center Inc., India as a associate engineer.

Amit Acharyya (M11) received the Ph.D. degree from the School of Electronics and Computer Science, University of Southampton, U.K., in 2011. He is currently an Assistant Professor with IIT Hyderabad, Hyderabad, India. His research interests include signal processing algorithms, VLSI architectures, low power design techniques, computer arithmetic, numerical analysis, linear algebra, bioinformatics, and electronic aspects of pervasive computing.

Charles Leech is a senior research assistant for the PRIME project at the University of Southampton where he received his 1st class hons BEng Electronic Engineering degree. His work focuses on power and performance optimisation of computer vision applications on heterogeneous embedded systems. Additionally, he is involved in the development of software frameworks for runtime management of many-core systems. He has interests in approximate computing, stereo vision, and machine learning for embedded devices.

Vibhusha Balagopal is a Senior Electronics Design Engineer at LumiraDx Technology, UK. She was awarded a first-class honors degree in Electronics and Communication Engineering from University of Calicut, India in 2008 and completed her Post Graduation Diploma in VLSI design in 2009 from National Institute of Electronics and IT, India. Currently, she is working in digital system design in embedded FPGA platforms for Point-Of-Care medical devices. Her research interests include run-time energy management in many/multi-core Embedded systems. Previously, she was working for the design and development of beam steering controller for AESA Radar application. She was working as Senior Research Assistant at University of Southampton UK, Scientist at Electronics and Radar Development Establishment - DRDO, India and Project Engineer at Wipro Technologies, India.

Dr. Merrett was the General Chair of the Energy Neutral Sensing Systems Workshop from 2013 to 2015. He is a fellow of the The Higher Education Academy.

Geoff V Merrett (GSM06-M09) received the B.Eng. degree (Hons.) in electronic engineering and the Ph.D. degree from the University of Southampton, Southampton, U.K., in 2004 and 2009, respectively.

He is currently an Associate Professor in electronic systems with the University of Southampton. His current research interests include low-power and energy harvesting aspects of embedded & mobile systems. He has published over 100 articles in journals/conferences in the above areas.

Bashir M. Al-Hashimi (M99-SM01-F09) is an ARM Professor of Computer Engineering, Dean of the Faculty of Physical Sciences and Engineering, and the Co-Director of the ARM-ECS Research Centre, University of Southampton, Southampton, U.K. He has published over 380 technical papers. His current research interests include methods, algorithms, and design automation tools for low-power design and test of embedded computing systems. He has authored or co-authored five books and has graduated 35 Ph.D. students.