J. Appl. Prob. 23,524528 (1986)

Printed in Israel

Applied Probability Trust 1986

BOUNDARY DOMINATION AND THE DISTRIBUTION OF

THE LARGEST NEARESTNEIGHBOR LINK IN HIGHER DIMENSIONS

J. MICHAEL STEELE,* Princeton University

LUKE TIERNEY,** University of Minnesota

Abstract

For a sample of points drawn uniformly from either the ddimensional torus or the dcube, d =~ 2. we give limiting distributions for the largest of the nearestneighbor links. For d ~_ 3 the behavior in the torus is proved to be different from the behavior in the cube. The results given also settle a conjecture of Henze (1982) and throw light on the choice of the cube or torus in some probabilistic models of computational complexity of geometrical algorithms.

COMPUTATIONAL GEOMETRY; SPIRAL SEARCH; EXTREMEVALUE DISTRIBUTION; GUMBEL DISTRIBUTION

1. Introduction

If pl, P2, . . .,p. are n given points of R", it is a basic problem of computational

geometry to determine the set of nearestneighbor linkages, i.e. to determine for

each pi which element Of {PI,P2, is nearest to pi.

The work done on this problem from the point of view of computational complexity is quite extensive, but the works of Friedman et al. (1975), Lee et al. (1976), Friedman et al. (1977), and Bentley et al. (1980) provide a tracing of the basic development of the areas in terms of averageease behavior. From the point of view of worstcase behavior, basic contributions are made in Shamos (1978), Lipton and Tarjan (1977), and Zoinowsky (1978).

The length of the largest of the nearestneighbor links is defined formally by

(1. 1) Z(p, p2,.. ., p.) = max min 11 pi pi ji

I;Sisn i:jpli

Received 19 July 1984; revision received 13 May 1985.

* Postal aodress: Department of Statistics, Princeton University, Fine Hall, Washington Road, Princeton, NJ 08544, USA.

** Postal address: School of Statistics, 270 Vincent Hall, 206 Church St SE, University of Minnesota, Minneapolis, MN 55455, USA.

Research supported in part by NSF Contract DMS8414069.

524

Boundary domination and the distribution of the largest nearest neighbor link 525

where lip q 11 denotes the usual Euclidean distance. This quantity comes up in almost all discussion of nearestneighbor computations although its appearance is not always explicit.

The object of principal interest here is the sequence of random variables Z. defined by

(1.2) Z. = Z(Xl, X2, ''', X.)

where the Xi are independent and uniformly distributed on either the dcube [0, 1f' or the dtorus obtained by identifying opposite faces of the dcube. This random variable is closely related to a similar weighted nearest neighbor variable which has been studied in Henze (1981), (1982), (1983).

There are three limit results which will be given and which provide an asymptotic understanding of Z.. Because of the proximity of Henze's work only the last limit result will be proved in detail. Theorem 2 answers the conjecture of Henze (1982), cf. p. 354, item 5.

One practical implication of Theorem 2 is that in the modeling applications of computational geometry one would be well advised in many cases to work in the dtorus as opposed to the dcube. To do otherwise, one risks having to be seriously concerned with counterintuitive boundary effects whenever d k 3. The value of this advice might be particularly felt by those who would attempt to appraise the computational complexity of a nearestneighbor procedure by means of simulation.

Theorem 1. For Xi, 1 i < , independent and uniformly distributed on the dtorus, one has

liM p(Z d > (t

(1.3) n + log n)lnWd) =1 exp( e')

Rd

where Wd is the volume of the unit sphere in d ~: 2.

For the dcube the boundary begins to play a role for d 3, as the following results illustrate.

Theorem 2. For Xi, 1 5 i < independent and uniform on [0,112, one has

liffl p(Z2

(1.4) . . = (t + log n)liTn) exp( e')

but for Xi, 1 :_5 i < , independent and uniform on [0, Ild, d 3, one has

liM p(Zd (t +

(1.5) n_. . log n)IWdn)

The weighted nearestneighbor random variables studied by Henze are given by

526 J. M. STEELE AND LUKE TIERNEY

Z.'=maxrnin ipiniixixiii,lixidsil

where dS is the boundary of the dcube. The proofs of (1.3) and (1.4) can be obtained by modification of the results of Henze (1982), and this modification will not be given here. The main goal is the proof of (1.5) which we give in Section 2.

Before quitting the introduction, it is worth noting that nearestneighbor statistics have recently been studied from a different point of view in Bickel and Breiman (1983) and Shilling (1983a, b). These authors provide much information about sums of functions of nearestneighbor link lengths and the application of such sums to the theory of goodnessoffit tests.

2. Boundary behavior

We now give the proof of the limit relation (1.5) for the interesting case of d _~ 3. We fix e > 0 and choose for each n a sequence of M(n) points yi on the onedimensional faces (i.e. the edges) of [0, fl` such that

(2.1) yi yl 2(1. + )z. for i3,~

where z','= (t + log n)lnwd, and

(2.2) all of the yi are at least a distance (1 + e)z, from the corners of the cube.

It is easy to check that we may choose M(n) such that M = M(n) a /Zn where a depends only on d.

We next let Q. be the event that the ball B (yi, ez. {x x yi cz.} contains exactly one of the points of {X, X2,. . .,X. 1 and the remainder of the ball B (yi, (1 + e )z, ) contains no further such points.

On setting D. = Ujm_i Q. we see that (1.5) follows at once if we show P(D.) 1. This will be done with help from the Poisson process.

We denote probabilities which are calculated with respect to a homogeneous Poisson process with rate n by a subscript ir, and we calculate

(2.3) P~ (D') = (1 P. (Q.)) M

and

P~ (Q.) = ne exp( n (1 + E)"COdZZ2')

(2.4) = ed (log n + t)2'd exp( 2 1d (1 + 6 )d (log n +, t)).

Since d 3, we now see that e > 0 can be, chosen such that (1 + E )d < 2 d~ '1d, so using the fact that M(n) awd' ld(log n )11d n 11d we obtain P, (D') 0.

i i

i

i

i i

Boundary domination and the distribution of the largest nearest neighbor link 527

It remains to show that this last relationship also holds under the uniform model.

For each 1 :5 i :5 M we let Ki and Li denote the number of sample points in B(yi,ezi) and B(Yi,(1+)zi)~B(yi,Ez,) respectively. The event D. depends only on these counts. We complete the proof of (1.5) by establishing the following result.

Lemma. For any event E. which depends only on {Ki, Li i 5 M} we

have P(E.) P~ (E~) 0 as n .

Proof of the lemma. We denote the probability mass function of Kt, K2,' ., Km, L,, L2, ' ',L,, by g. or by h. accordingly as one uses the Poisson or the uniform model. The likelihood ratio is given by

(2.5) & = h~ (K, K2,. Km, L I, L2, * * , Lm

g. (K,, K2, ' Km, L,, L2,' . ., Lm)

In order to show P(E.) P, (E~) 0, it will suffice to show that R. > 1 in probability under the uniform model. (For this reduction see Weiss (1969), pp. 261262, or Weiss (1965), pp. 219220.)

To write an explicit formula for the likelihood ratio R,, we introduce the following notations:

d Zd 1d

(2.6) p. = e "o),,z '2'", q~ = [(1 + e e 1 0)d n2 r. =p. + q.

and

M M

(2.7) U= U~ Ki, V = V~ Li, WW~ = U. + L~.

This notation permits us to write

(2.8) & = Wwn`0 MOw exp(nMr.)

where (n). denotes the falling factorial n(nl)(n2) ... (n s+l). The leading term (n)wn~w can be most easily estimated by first noting

W

(2.9) (n)wnw (1 k/n) _~ 1 W(W + 1)1(2n).

Under the uniform model W = W. is just a binomial random variable with sample size n and success probability M(1 + e)"o),,zn2'"P (log n MY""" for a constant One can thus easily check that W',In 0 in probability, so (n)"wn w 1 in probability.

One can similarly express the remaining factors of R. as

(2.10) (1Mr,)"wexp(nMr,)= exp(n(Mr. +log(l Mr,)) W1og(l Mr.).

i

1

i

i

i

528 J. M. STEELE AND LUKE TIERNEY

M2 2

Since Mr. 0 and n rn > 0, and W. log(l Mr.) 0 in probability we have

completed the proof that R,, 1 in probability. This was all we needed to

establish the main result expressed in Equation (1.5).

Acknowledgement

We should like to thank Jon Bentley for bringing this problem to our attention. We should also like to thank the referee for careful comments which have helped the exposition enormously.

References

BENTLEY, J. L. (1976) Divide and Conquer Algorithms for Closest Point Problems in Multidimensional Space. Unpublished Ph.D. Thesis, University of North. Carolina.

BENTLEY, J. L., WEIDE, B. W. AND YAo, A. G. (1980) Optimal expectedtime algorithms for closest point problems. ACM Trans. Math. Software 6, 563580.

BICKEL, P. J. AND BREIMAN, L. (1983) Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. Ann. Prob. 11, 185214.

FRIEDMAN, J. H., BASKETT, F. AND SHUSTEK, L. J. (1975) An algorithm for finding nearest neighbors. 1EEE Trans. Computers 24, 10001006.

FRIEDMAN, J. H., BENTLEY, J. L. AND FINKEL, R. A. (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Software 3, 209226.

HENZE, N. (1981) Ein asymptotischer Satz uber den maximalen Minimalabstand von unabhAngigen Zufallsvektoren mit Anwendung auf einen Anpassungstest im RP und auf der Kugel. Unpublished Doctoral Dissertation, University of Hannover.

HENZE, N. (1982) The limit distribution for maxima of 'weighted' rth nearestneighbor distances. J. Appl. Prob. 19, 344354.

HENzE, N. (1983) Ein asymptotisher Satz uber den maximalen Minimlabstand von unabhfingigen Zufallsvektoren mit Anwendung auf einen Anpassungstest im RP und auf der Kugel. Metrika 30, 245259.

LEE, R. C. T., CHIN, 1. 1. AND CHANG, S. C. (1976) Application of principal component analysis to multikey searching. IEEE Trans. Software Engineering 2, 185193.

LIPTON, R. AND TARJAN, R. E. (1977) Applications of a planar separator theorem. 18 Symp. Foundations of Computer Science, IEEE, 162170.

PAPADIMITRIOU, C. H. AND BENTLEY, J. L. (1980) Worstcase analysis of nearest neighbor searching by projection. Technical Report CMUCS80109, Department of Computer Science, CarnegieMellon University.

SCHILLING, M. F. (1983a) Goodness of fit testing in R' based on the weighted empirical distribution of nearest neighbor statistics. Ann. Statist. 11, 1~12.

SCHILLING, M. F. (1983b) An infinitedimensional approximation for nearest neighbor goodness of fit tests. Ann. Statist. 11, 1324.

SHAMOS, M. 1. (1978) Computational Geometry. Unpublished Ph.D. Thesis, Yale University.

WEIDE, B. W. (1978) Statistical Methods in Algorithm Design and Analysis. Ph.D. Thesis, CarnegieMellon University.

WEiss, L. (1965) On asymptotic sampling theory for distributions approaching the uniform distribution. Z Wahrscheinlichkeitsth. 4, 217221.

WEiss, L. (1969) The asymptotic joint distribution of an increasing number of sample quantiles. Ann. Inst. Statist. Math. 21, 257263.

YLIVAL, G. (1976) Finding nearest neighbors. Inf Process. Lett. 5, 6365.

Z01NOWSKY, J. E. (1978) Topics in Computational Geometry. Ph.D. Thesis, Stanford University.