Figure 5: Comparison of Roussopoulos et al. and simplified knn search with Hilbert R-tree and clustered
query data points
Figure 4: Comparison of Roussopoulos et al. and simplified knn search with Hilbert R-tree and random
query data points
Figures 4 and 5 show the performance difference of Roussopoulos et al. and our simplified knn search for 20,000 randomly distributed query points and clustered query points respectively.
In figure 4, the simplified knn search performance is only slightly improved for the 20,000 randomly generated query points, (not more than 10% fewer nodes accessed). If a query data point is far outside the clustered data, then there are fewer overlaps between MBRs around the clustered boundary. In general, the clustered query points in figure 5 have more nodes accessed than the randomly generated points, except with 16 dimensions. However, the shape of the graph mainly depends on how close the data points are oriented in the database in each dimension, and it is very hard to visualise such cases in high dimensions. In fact, there is a clear performance difference in figure 5; the simplified knn search has considerably less nodes accessed to retrieve nearest neighbours compared to Roussopoulos et al.'s search. The results show that the simplified knn search can access up to 40% fewer nodes than the other knn search. For the clustered database, our knn search can give a reasonable performance even with high dimensions. For our technique, the number of data object comparisons to retrieve 10 nearest neighbours in 2D is about 0.3% of the entire database population, and about 8.0% in 14D for 50 nearest neighbours compared to the other method which was at 1.2% and 12% for 2D and 14D cases respectively.
However, we have also compared both knn searches with a synthetic random database, and the improvement of our proposed nearest neighbour search is no better than 3% fewer nodes accessed as expected. There is significantly less overlap between MBRs in the random database than the clustered database, and the simplified knn search only performs well when there are several nodes with zero MINDIST value. Therefore, both knn search algorithms have almost the same number of nodes accessed on random databases.
Despite the nature of databases, our method can still perform similarity searches with less calculation.