__Nearest Neighbor Searching__

__Nearest Neighbor Searching__

Nearest neighbor searching is an important problem in a variety of applications, including knowledge discovery and data mining, pattern recognition and classiﬁcation, machine learning, data compression, multimedia databases, document retrieval, and statistics. We are given a set *S *of objects in some space to be preprocessed, so that given a query object *q*, the closest object (or objects) of *S *can be reported quickly.

There are many ways in which to deﬁne the notion of similarity. Because the focus of this chapter is on geometric approaches, we shall assume that proximity is deﬁned in terms of the well known Euclidean distance. Most of the results to be presented below can be generalized to any *Mi**n**kow**ski *(or *L**m*) *metric*, in which the distance between two points **p**** **and **q **is deﬁned to be

where *m **≥ *1 is a constant. The case *m *= 2 is the Euclidean distance, the case *m *= 1 is the Manhattan distance, and the limiting case *m *= *∞ *is the max distance. In typical geometric applications the dimension *d *is assumed to be a ﬁxed constant. There has also been work on high dimensional proximity searching in spaces of arbitrarily high dimensions [49] and in arbitrary (nongeometric) metric spaces [23], which we shall not cover here.

There are a number of natural extensions to the nearest neighbor problem as described above. One is to report the *k *nearest neighbors to the query point, for some given integer *k*. Another is to compute all the points lying within some given distance, that is, a range query in which the range is deﬁned by the distance function.

Obviously, without any preprocessing whatsoever, the nearest neighbor search problem can be solved in *O*(*n*) time through simple brute-force search. A number of very simple methods have been proposed which assume minimal preprocessing. For example, points can be sorted according to their projection along a line, and the projected distances can be used as a method to prune points from consideration [40, 44, 54]. These methods are only marginally eﬀective, and provide signiﬁcant improvements over brute-force search only in very low dimensions.

For uniformly distributed point sets, good expected case performance can be achieved by simple decompositions of space into a regular grid of hypercubes. Rivest [65] and later Cleary [28] pro- vided analyses of these methods. Bentley, Weide, and Yao [17] also analyzed a grid-based method for distributions satisfying certain bounded-density assumptions. Intuitively, the points are bucketed

into grid cells, where the size of the grid cell is based on the ex- q pected distance to the nearest neighbor. To answer a query, the grid cell containing the query point is located, and a spiral-like p search working outwards from this cell is performed to identify nearby points. Suppose for example that *q *is the query point and *p *is its closest neighbor. Then all the grid cells overlapping a ball centered at *q *of radius dist(*p, q*) would be visited.

Grids are easy to implement, since each bucket can be stored as a simple list of points, and the complete set of buckets can be arranged in a multi-dimensional array. Note that this may not be space eﬃcient, since it requires storage for empty cells. A more space- eﬃcient method is to assign a hash code to each grid cell based on its location, and then store only the nonempty grid buckets in a hash table. In general, grid methods do not work well for nearest neighbor search unless the point distribution is roughly uniform. As will be discussed below, more sophisticated methods are needed to achieve good eﬃciency for nonuniformly distributed data.

**Nearest Neighbor Searching through Point Location**

One of the original motivations for the Voronoi diagram is nearest neighbor searching. By deﬁnition, the Voronoi diagram subdivides space into cells according to which site is the closest. So, in order to determine the closest site, it suﬃces to compute the Voronoi diagram and generate a point location data structure for the Voronoi diagram. In this way, nearest neighbor queries are reduced to point location queries. This provides an optimal *O*(*n*) space and *O*(log *n*) query time method for answering point location queries in the plane. Unfortunately, this solution does not generalize well to higher dimensions. The worst-case combinatorial complexity of the Voronoi diagram in dimension *d *grows as Θ(*n**I**d/*2*l*), and optimal point location data structures are not known to exist in higher dimensions.

Perhaps the most popular class of approaches to nearest neighbor searching involves some sort of hierarchical spatial subdivision. Let *S *denote the set of *n *points in R*d *for which queries are to be answered. In such an approach, the entire space is subdivided into successively smaller regions, and the resulting hierarchy is represented by a rooted tree. Each node of the tree represents a region of space, called a *c**e**l**l*. Implicitly, each node represents the subset of points of *S *that lie within its cell. The root of the tree is associated with the entire space and the entire point set *S*. For some arbitrary node *u *of the tree, if the number of points of *S *associated with *u *is less than some constant, then this node is declared to be a leaf of the tree. Otherwise, the cell associated with *u *is recursively subdivided into smaller (possibly overlapping) subcells according to some *spli**t**ting rule*. Then the associated points of *S *are distributed among these children according to which subcell they lie in. These subcells are then associated with the children of *u *in the tree.

There are many ways in which to deﬁne such a subdivision. Perhaps the earliest and best known example is that of the k-d tree data structure. Bentley [16] introduced the *k**–**d tree** *data structure (or *kd**–**tree*) as a practical general-purpose data structure for many types of geometric retrieval problems. Although it is not the asymptotically most eﬃcient solution for these problems, its ﬂexibility makes it a popular choice for implementation. The cells of a k-d tree are axis-aligned hyperrectangles. Each internal node is associated with an axis-orthogonal splitting hyperplane. This hyperplane splits the rectangular cell into two rectangular subcells, each of which is associated with one of the two children. An example is shown in Figure 63.12.

The choice of the splitting hyperplane is an important issue in the implementation of the k-d tree. For the purpose of nearest neighbor searching, a good split is one that divides the points into subsets of similar cardinalities and which produces cells that are not too skinny, that is, the ratio between the longest and shortest sides is bounded. However, it is not always possible to achieve these goals. A simple and commonly used method is to cycle through the various coordinate axes (that is, splitting along *x*, then *y*, then *z*, then back to *x*, and so on). Each time the split is made through the median coordinate along the splitting dimension [31, 66]. Friedman, Bentley and Finkel [41] suggested the following method, which is more sensitive to the data distribution. First, compute the minimum axis- aligned bounding box for the set of points associated with the current cell. Next choose the splitting axis to be the one that is parallel to the longest side of this box. Finally, split the points by a hyperplane that is orthogonal to this axis, and which splits the points into two sets of equal size. A number of other splitting rules have been proposed for k-d trees, including the sliding midpoint rule by Arya and Fu [3] and Maneewongvatana and Mount [57], variance minimization by White and Jain [76], and methods by Silva Filho [37] and Sproull [73]. We will discuss other subdivision methods in the next section as well.

It is possible to construct the k-d tree of an *n*-element point set in *O*(*n *log *n*) time by a simple top-down recursive procedure. The process involves determining the splitting axis and the splitting coordinate along this axis, and then partitioning the point set about this coordinate. If the splitting rule partitions the point set about its median coordinate then it suﬃces to compute the median by any linear-time algorithm for computing medians [30]. Some splitting methods may not evenly partition the point set. In the worst case this can lead to quadratic construction time. Vaidya showed that it is possible to achieve *O*(*n *log *n*) construction time, even when unbalanced splitting takes place [75]. The total space requirements are *O*(*n*) for the tree itself.

Given a query point *q*, a nearest neighbor search is performed by the following recursive algorithm [41]. Throughout, the algorithm maintains the closest point to *q *encountered so far in the search, and the distance to this closest point. As the nodes of the tree are traversed, the algorithm maintains the *d*-dimensional hyperrectangular cell associated with each node. (This is updated incrementally as the tree is traversed.) When the search arrives at a leaf node, it computes the distance from *q *to the associated point(s) of this node, and updates the closest point if necessary. (See Figure 63.13.) Otherwise, when it arrives at an internal node, it ﬁrst computes the distance from the query point to the associated cell. If this distance is greater than the distance to the closest point so far, the search returns immediately, since the subtree rooted at this node cannot provide a closer point. Otherwise, it is determined which side of the splitting hyperplane contains the query point. First, the closer child is visited and then the farther child. A somewhat more intelligent variant of this method, called *p**riority search*, involves storing the unvisited nodes in a priority queue, sorted according to the distance from the query point to the associated cell, and then processes the nodes in increasing order of distance from the query point [9].

Other Approaches to Nearest Neighbor Searching

The k-d tree is but one example of a general class of nearest neighbor search structures that are based on hierarchical space decomposition. A good survey of methods from database literature was given by Bo¨hm, Berchtold, and Keim [20]. These include the R-tree [46] and its variants, the R*∗*-tree [15], the R+-tree [69], and the X-tree [18], which are all based on recursively decomposing space into (possibly overlapping) hyperrectangles. (See Chapter 21 for further information.)

For the cases studied, the X-tree is reported to have the best performance for nearest neighbor searching in high dimensional spaces [20].

The SS-tree [76] is based on subdividing space using (possibly overlapping) hyperspheres rather than rectangles. The SR-tree [51] uses the intersection of an enclosing rectangle and enclosing sphere to represent a cell. The TV-tree [56] applies a novel approach of considering projections of the data set onto higher dimensional subspaces at successively deeper levels in the search tree.

A number of algorithms for nearest neighbor searching have been proposed in the algorithms and computational geometry literature. Higher dimensional solutions with sublinear worst-case performance were considered by Yao and Yao [77]. Clarkson [25] showed that

queries could be answered in *O*(log *n*) time with *O*(*n**I**d/*2*l*+*δ *) space, for any *δ** **> *0. The

*O*-notation hides constant factors that are exponential in *d*. Agarwal and Matouˇsek [2]

generalized this by providing a tradeoﬀ between space and query time. Meiser [59] showed that queries could be answered in *O*(*d*5 log *n*) time and *O*(*n**d*+*δ *) space, for any *δ** **> *0, thus showing that exponential factors in query time could be eliminated by using suﬃcient space.

**Approximate Nearest Neighbor Searching**

In any ﬁxed dimensions greater than two, no method for exact nearest neighbor searching is known that achieves the simultaneous goals of roughly linear space and logarithmic query time. For methods achieving roughly linear space, the constant factors hidden in the asymptotic running time grow at least as fast as 2*d *(depending on the metric). Arya *e**t al. *[11] showed that if *n *is not signiﬁcantly larger than 2*d*, then boundary eﬀects decrease this exponential dimensional dependence. Nonetheless, the so called “curse of dimensionality” is a signiﬁcant impediment to computing nearest neighbors eﬃciently in high dimensional spaces.

This suggests the idea of computing nearest neighbors approximately. Consider a set of points *S *and a query point *q*. For any *E > *0, we say that a point *p **∈ **S *is an *E**-approximate **nearest neighbor *of *q *if

where *p**∗ *is the true nearest neighbor of *q *in *S*. The approximate nearest neighbor problem was ﬁrst considered by Bern [19]. He proposed a data structure that achieved a ﬁxed approximation factor depending on dimension. Arya and Mount [10] proposed a randomized data structure that achieves polylogarithmic query time in the expected case, and nearly linear space. Their approach was based on a combination of the notion of neighborhood graphs, as described in Section 63.3.3, and skip lists. In their algorithm the approximation error factor *E *is an arbitrary positive constant, which is given at preprocessing time.

Arya *e**t al. *[12] proposed a hierarchical spatial subdivision data structure, called the *BB**D**-tree*. This structure has the nice features of having *O*(*n*) size, *O*(log *n*) depth, and each cell has bounded aspect ratio, that is, the ratio between its longest and shortest side is bounded. They achieved this by augmenting the axis-aligned splitting operation of the k-d tree (see Figure 63.14(a)) with an additional subdivision operation called *shrinking *(see Figure 63.14(b)). A shrinking node is associated with an axis-aligned rectangle, and the two children correspond to the portions of space lying inside and outside of this rectangle, respectively. The resulting cells are either axis-aligned hyperrectangles, or the set-theoretic diﬀerence of two axis-aligned hyperrectangles. They showed that, in all ﬁxed dimensions *d** *and for *E > *0, it is possible to answer *E*-nearest neighbor queries in *O*(log *n*) time using the BBD-tree. The hidden asymptotic constants in query time grow as (1*/**E*)*d*.

Duncan *e**t al. *[33] proposed an alternative structure, called the *B**AR tree*, which achieves all of these combinatorial properties and has convex cells. The BAR tree achieves this by using cutting planes that are not necessarily axis-aligned. Clarkson [26] and Chan [22] presented data structures that achieved better *E *dependency in the query time. In particular, they showed that queries could be answered in *O*((1*/**E*)*d/*2 log *n*) time.

**Approximate Voronoi Diagrams**

As mentioned in Section 63.4.1 it is possible to answer nearest neighbor queries by applying a point location query to the Voronoi diagram. However, this approach does not generalize well to higher dimensions, because of the rapid growth rate of the Voronoi diagram and the lack of good point location structures in dimension higher than two.

Har-Peled [47] proposed a method to overcome these problems. Given an error bound *E > *0, an *app**r**o**x**i**mate Voronoi diagram *(AVD) of a point set *S *is deﬁned to be a partition of space into cells, where each cell *c *is associated with a *r**ep**r**ese**n**tative **r**c **∈ **S*, such that *r**c *is an *E*-nearest neighbor for all the points in *c *[47]. Arya and Malamatos [4] generalized this by allowing up to some given number *t **≥ *1 representatives to be stored with each cell, subject to the requirement that for any point in the cell, one of these *t *representatives is an *E*-nearest neighbor. Such a decomposition is called a (*t, E*)*-AVD*. (See Figure 63.15.)

Of particular interest are AVDs that are constructed from hierarchical spatial decompositions, such as quadtrees and their variants, since such structures support fast point location in all dimensions. This yields a very simple method for performing approximate nearest neighbor searching. In particular, a tree descent determines the leaf cell containing the query point and then the closest of the *t *representatives is reported.

Har-Peled [47] showed that it is possible to construct a (1*, **E*) AVD in which the number of

leaf cells is *O*((*n/E**d*)(log *n*) log(*n/E*)). Arya and Malamatos [4] and later Arya, Malamatos, and Mount [8] improved these results by showing how to construct more space-eﬃcient AVDs. In all constant dimensions *d*, their results yield a data structure of *O*(*n*) space (including the space for representatives) that can answer *E*-nearest neighbor queries in *O*(log *n *+ (1*/**E*)(*d**−*1)*/*2) time. This is the best asymptotic result known for approximate nearest neighbor searching in ﬁxed dimensional spaces.

**Sources and Related Material**

General information regarding the topics presented in the chapter can be found in standard texts on computational geometry, including those by Preparata and Shamos [64], Edels- brunner [35], Mulmuley [61], de Berg *et al. *[31], and Boissonnat and Yvinec [21] as well as Samet’s book on spatial data structures [66]. Further information on point location can be found in a survey paper written by Snoeyink [72]. For information on Voronoi diagrams see the book by Okabe, Boots and Sugihara [62] or surveys by Aurenhammer [13], Aurenham- mer and Klein [14], and Fortune [38]. For further information on geometric graphs see the survey by O’Rourke and Toussaint [63]. Further information on nearest neighbor searching can be found in surveys by Bo¨hm *et al. *[20], Indyk [49], and Chavez *et al. *[23].