i

I

I

I

i i

414 J. L. BENTLEY It al.

Delete, Predecessor, Successor, Maximum, and Minimum, The basic data structure that we use to represent such a set Of size at most N is a sorted linked list implemented by the two arrays Link [0 ... NJ and Value [I ... NJ. The value in Link [0] Points to the first element in the list, Link [Link [0]] points to the next element, etc. We call such a data structure a J-1ist after Janko [J I, J2] who first studied the structure in a randomized setting. Among other things, we will show that Member, Insert, Predecessor, Successor, and Maximum can all be accomplished in 2,1N_ - c expected steps where c is a small constant, and that this bound is optimal under a plausible model of constraints imposed by the data structure. We also show that Delete requires just 4.,FN - 2c expected steps and that M inimum requiresjust one step (Link [0] points to the minimum element). All of these bounds (except for Minimum) are dramatically better than the worst case bound of N steps.

Although quite simple, the J-list data structure is surprisingly efficient. In fact, it is superior to all of those described by Knuth [K] for certain applications. The salient attributes of such applications are as follows:

• Space is important. This structure uses only one extra word of storage per element, while binary search trees use at least two extra words, and various hashing schemes use varying amounts of extra storage. However, the storage for this structure must be available in a single contiguous block.

• The "orderedness" operations of Successor, Predecessor, Minimum, and Maximum are frequent; these are not possible in most hashing schemes.

• Insertions and deletio' ns are frequent. If the data structure

changes rarely, binary search in a sorted array is very efficient.

• Program simplicity is important. Each operation on this structure requires only about a dozen lines of code, while some operations on balanced binary search trees require over 100 lines of code.

• Run time is important for problems of medium size (where medium means that N is between, say, 100 and I o,000). If N is below that range, simple sequential strategies are probably efficient enough. If Nis above that range, then the logarithmic search time of binary search will be necessary for many applications. When Nis in the medium range, though, the low constant factors of this structure will make it competitive with binary search trees.

A Randomized Data Strudure for Ordered Sets 415

Of course, the simple linked list is one of the most basic and wellknown data structures, and has arisen in countless contexts. Most relevant to this chapter is the prior work of Janko [J 1, J21 who studied randomized algorithms for sorting using linked lists and obtained an O(N 112 ) bound on the expected time needed to sort N items.

The remainder of the chapter is divided into sections as follows. In Section 2, we define the problem more precisely and observe that the worst-case complexity of performing a Member Search is linear in N. For the most part, we concentrate our efforts on the analysis of a simple algorithm for Member Search. This is because the algorithm for Member Search can be easily transformed into an efficient algorithm for each of the other operations. In Section 3, we describe the Mist structure and explain its relationship to a simple Guess-Decrement game. We also describe an optimal randomized algorithm for Member Search and show how it can be extended to form efficient algorithms for Insert, Delete, and the other operations. Section 4 considers some natural extensions of the basic model and contains some additional probabilistic analysis.

2. THE PROBLEM AND ITS DETERMINISTIC

COMPLEXITY

The Mist is implemented in contiguous storage by the two arrays Link [0 ... NJ and Value [I ... N ]. The pointer Link [0] points to the first element of the list, Value [Link [01]. The next element can be found in Value [Link [Link [0]]], and so forth. The end of the list is denoted by an element whose Link field contains - 1. Furthermore, we will insist that the array is dense: Value [I ... NJ must contain N elements of the represented set. The sortedness of the linked list implies that if Link [I] is not - 1, then Value [11 < Value [Link [I]] We will often refer to ValueV] and Link[I] together as node I. Figure I illustrates the array representation of the sorted linked list <2.6,3.1,4.1,5.3,5.8, 5.9,9.7>.

It is clear that performing a Member Search in such an array requires accessing at most N elements of the array (either by following Link fields through the list or simply by iterating through Value fields of the array). We will now show that in the worst case, this

Figure 1. An array representation of the sorted linked list <2.6,3.1, 4.1, 5.3, 5.8, 5.9, 9.7 >.

1 0 1 2 3 4 5 6 7

value(l] 1 3.1 ;.31 51 1 9-Ig

1

Link[l) 5 1 7 1 1 1 -_fi 3 1 -1--

416 J. L. BENTLEY et 4

much time is necessary to decide whether a given element is in the list. We will assume that a (deterministic) search algorithm is composed of operations of the following types, each with unit cost.

I . Determine the index of the node at the head of the list (by

accessing Link [0]). There is one operation of this type.

2. Determine the successor of node I for I < I < N (by access

ing Link [I]). There are N operations of this type.

3. Determine the Value of node I, for I < I < N (by accessing

Value [I]). There are N operations of this type.

(Note that if operations of type 2 have no cost; then binary search can be used to solve the problem in logarithmic time.)

Our model assumes that a protagonist specifies a sequence of the above operations whiie an adversary ensures that N operations will be required. We will assume that the adversary knows the value of the key the protagonist seeks, which we will call V, and that other key values may be assigned arbitrarily by the adversary. We will describe a strategy that enables the Adversary to delay returning V until the protagonist has specified a sequence of N operations. Without loss of generality, we will assume that whenever one of Value [11 or Link [I] is asked, both of Value [1] and Link [11 are provided at a cost of a single step (I <, I < N). The value of V will be the maximum element in the list. There are two cases depending on whether or not the protagonist asks the type I question.

Case 1: The type I question is not asked.

The adversary always answers questions so that the protagonist has queried a contiguous subset of the ordered list. In particular, assume that the protagonist asks about node I where I has not yet been queried. (Remember that Value [I] and Link (I] are always provided together, at a total cost of one.) If I = Link [J] where J is the node at the head of the continuous subset of previously queried nodes, then the adversary assigns the largest value yet given (but less than V. of course) to Value [I] and assigns an as yet unqueried node to Link (I]. If 10 Link [J], then the adversary assigns the smallest value yet given to Value [I] and sets Link [I] = K where K is the smallest node in the contiguous subset of previously queried nodes. In either case, the set of queried nodes continues to form a contiguous subset of ordered list, with all values less than V.

i

418 J. L. BENTLEY et al.

3.1. A Class of Randomized Algorithms for Member Search

By combining probabilistic access with access by predecessor link, a wide range of algorithms can be considered. For example, the following pseudo-Pascal program searches for the element E, using the order of operations specified by the array Step. When Step [J] is zero, a random sample occurs. Otherwise the program follows the next link in the list. Note that when a random sample is chosen, the position in the list is updated only if the random position is closer to (but not at or beyond) the location of E. This strategy ensures that the updated position in the list never worsens and that when E is eventually found, its predecessor will also have been found (since E will have been reached via a link). (This particular code assumes that Value [0] is - oo and Value[- 1] is 00.)

P := 0 J:= 0 do until exit J:= J+ I if Step [J] = 0 then do R:= Random(I,N) ifValuer ]

For any specified Step array, the expected performance of the associated algorithrn will depend on the value of E being searched. For example, if E is less than or equal to the smallest item in the list, then the algorithm will terminate on or before the first Link access. For the time being, we will focus on the more interesting case when E is bigger than the largest item in the list. This is, in fact, the Worst case for any step array in terms of expected running time, and is representative of the case when E has a random rank. For expediency, we will defer the proof of these assertions to Section 4, where we consider search values with an arbitrary index. We also consider more sophisticated algorithms in Section 4, including procedures that decide whether to step forward or move randomly based on whether or not previous random moves were successful,

420 J. L. BENTLEY et al.

Note that E(u) also denotes the expected running time of the Member Search algorithm for the corresponding Step array. Hence, determining the optimal a is equivalent to determining the optimal algorithm from the class described in Section 3.1. For simplicity, we will use the G-D notation henceforth.

3.3. An Optimal Strategy

The task of finding an optimal strategy for the G-D game will proceed in two steps. The first step consists of finding the best strategy from among those of the form G k D' for some k, The second, and more difficult, step consists of showing that there is an optimal strategy having this form.

We will start by analyzing strategies of the form G k DI. Within this restricted class, it is easy to determine the best values of k and the minimum of E(G'D').

THEOREM 1. Let k(N) be the value of k that minimizes E(G'D'). Then k(N) = ~,N_ - I - l/(24,vTN_) + 0(11N).

Proof. From the definition,

k+N

E(GkD N) E Qj(G k D N).

j=0

Since the first k operations are Guesses the game cannot end there,

so Qj(GkD N) = I for j = 0, . . . , k. The probability of not terminat

ing during the first d Decrements is (N - d)k N -k' since all the

Guesses must be larger than d in order not to terminate. Therefore

N-I

E(GkD N) = k+ 1+ Z (N-d )k N -k

d=1

N-1

= k+ I +N -k E d k

d=1

= k + I + NI(k + 1) - 1/2+ kl(12N) + 0(k'IN 2).

The minimum occurs when

I - NI(k + 1)2 + 11(12N) + 0(k1N 2) = 0

422

J. L. BENTLEY

LEMMA2. For any sequence co, Pj(o)D)/PO(o)D),

Proof. Each ratio is a product of terms of the form

[(N -j - hi - ... - bX(N - bi - - - - - h,)]ai. The sequence (oD has one more D in the last block than o-), so b. increases by one in wD. When Po (ojD) > 0, a comparison of these terms, letting K = N - hi - - - - - b,, reveals that

[(K - j - I)I(K - l)1a' < [(K - j)1K]aj

and thus that Pj (o)D)/Po ((oD) < Pj ((o)lpo (co). If Po (co]D) = 0, we

define the ratio to be zero and the inequality still holds. 0 1

Remember that our goal is to prove that the optimal strategy has the form GkD'. To do this, we next analyze the effect of minor variations in strategy on the expected number of operations. Then we will show that if a small variation improves the strategy, then a larger change could mean even more improvement. The two sequences which we will compare first are a = o)DG'D' and U* = (oGkD'. The only difference between a and u* is the position of the block G'. The following lemma gives a method for comparing these two strings.

k N) k

LEMMA 3. E((oDG D < E(o)G D') if and only if Vk ((o) < 1, where

N-1

P I (CO)/PO ((0)

+ Y_ Pd ((O)/Pb ((0)

d=1

Vk ((0) x [(N- d + I ) k (N - d)k ]k-'N -k if po((O) > 0

0 if PO(CO) = 0.

Proof. Let a and a* be defined as above. We would like to know

when E(a) - E(a*) < 0. The following values for Q, (u) and Q,,,,(u*) can be easily verified.

if M < JO)j

P, ((0) if 1(01 + I <_ M

Q. (a) + k + I

PM._jcoj-k((O)

1)k -k

x (N-m + 1(ol +k+ N if m > 1(t)l + k + I

I

A Randomized Data Structure for Ordered Sets 423

Q. (a) if m '< 10)l

NO)) if Icul + I '< M

Q. (a*) = < 1(ol + k

P,n-jcoj-k((0)

I X (N - m + jo)l + k)k N-' if m > 1col + k.

Thus E(a) - E(u*) [Q.(u) - Q,(a*)] <, 0 is equivalent to

N-1

(k + I)PI (w) - kP0 (w) + E Pd (w) [(N - d + I)k

d=1

)k -k pjo)) <

-(N-d IN '0.

Rearranging terms slightly gives

N-1

P 1 (0)) + Y, Pd (00 [(N - d + 1)k (N - d )k I N -k k-' < Pjw).

d=1

El

Combining Lemmas 2 and 3 enables us to extend a minor variation of the string into a more radical change. Specifically, if the last block of G's is more efficient when it is moved to the right one place, then it is best to remove the block of Guesses altogether.

UmmAA Fora1l(o,ifE(o,)DGkD N) < E(o-)GkDv), then E((OD'v) <, E((oDG k D').

Proof. If E(coDiG k D') E((oDj-'G k D') for some j then Vk (coDi -') < I by Lemma 3. By Lemma 2 and the definition of Vk (co), we know that Vk (col)j) < Vk ((oDi '), and thus that Vk(a)Dj) < 1. Thus, we can conclude by Lemma 3 that E((oDi " G k D') < E((oD I . G k D'). The proof of the lemma is completed by applying this process inductively.

It is now a simple matter to prove our main result.

THEOREm2. For every starting sequence co, there exists an integer r >, 0 such that E(coGrD N) < E(wa) for every completed sequence Wa.

Proof. Let a denote the shortest (in length) sequence for which

N eja is complete and E((ocr) = min, E(coT). If a is of the form UD

424 J. L. BENTLEY et al.

then we are done. Otherwise or = o)*DG'Di, where G k is the last block of Guesses and j > 0. By the optimality of a, we know that E(w(o*DG~Dj) < E(cow*G k Di "). Thus by Lemma 4, we know that E((o(o*Dj`) < E((oco*DG`Dj) which contradicts the minimality of u.

Thus the best way to finish any initial sequence is by a block of Guesses followed by Decrements. By letting (o be the empty string, we find that the optimal strategy for the game is G'D'. Recalling Theorem 1, we find that the optimal value is near -,IN-- - I 11(24,,fN-) + 0(11N) and thus S(N) = 2,,FN- - c where c is a constant that tends to 1/2 as N gets large. Hence the expected number of steps for Member Search is at most 2,1N--. As a consequence, it is not difficult to show that this is within one or two steps of optimal whenever we are searching for an item that is bigger than the median. Searches for items less than the maximum are discussed more thoroughly in Section 4.

3.4. Algorithms for the Other Operations

Thus far we have considered only the problem of searching the linked list to determine if it contains a given element. It is easy to perform many other set operations on this structure. The following list summarizes those operations, and describes their costs in terms of the number of Value elements accessed.

• Member: The previous sections studied the problem of searching to determine whether a given element is a member of the set represented by the linked list. Cost: 2,JN-.

• Insert: A new element can be inserted in the list by using

Member Search. Cost: 2-,,IN--.

• Delete: The first step in deleting an element is to find that element by a search algorithm, and then modify the Link field of its predecessor to point to its successor. This takes 2-,,IN-references to the Value array. The next step must patch the "hole" created in the dense array by moving the last element of the array to the vacant position. Searching for the last element requires 2.,FN references. Cost: 4-,fN--

Predecessor: The element immediately preceding a given element can be found by a simple modification to the Member Search algorithm. Cost: 2-,fN-.

A Randomized Data Structure for Ordered Sets 425

• Succcessor: The element immediately succeeding a given element can be found by a simple modification to the Member Search algorithm. Cost: 2,12Y.

• Minimum: The minimum element in the set is pointed to by

Link [0]. Cost: 1.

• Maximum: The maximum element in the set can be found by

searching for infinity. Cost: 2-,JN-.

Each of the above operations is straightforward to implement given the Member Search algorithm and basic techniques for dealing with data structures for searching (described, for example, by Knuth [K]). Furthermore, the simplicity of the algorithm implies that the constant factors in the running time of the program will be relatively small. The only deviation that a programmer should make from the Member Search algorithm deals with the random number generation: since some random number generators are very slow, it might be preferable to use some other approach to sample the k elements.

4. EXTENSIONS AND REMARKS

At first glance, it might appear that our proof technique depends on the fact that Guesses do not yield zero. This is not the case. When a Guess returns an integer uniform on [0... N], the optimal strategy differs from the above strategy by only 0(l1.,/N_) Guesses.

Our analysis of the G-D game was for the worst case task of searching for the final element in the list. We will now consider searches for a random element. The value i can be interpreted as the number of links between our present position in the list and the position of the element we are searching for. When searching for the last element we start at i = N. Suppose a Member search seeks an element whose position is unknown and randomly distributed. We should then start at an unknown, random i, or equivalently start at i = N and do one Guess to randomize i before counting operations. By Theorem 2 the optimum strategy is then G"-'D' as opposed to GD' when searching for the last element.

Sometimes it is useful to find thejth element givenj. Whenj is small it is easy to follow links and whenj = N we can use the G-D strategy. Between these two extremes, the strategies used thus far are not necessarily valid since the value of the j th element is

426

J. L. BENTLEY et al.

Figure2. Graph of e(j) and S(Nj)l,,I-N- where S(Nj) is the optimal expected number of operations when searching for the jth element.

2

I

sm.i)A-

1 2 2 3 4VIR

unknown. But if we know the value as well as the position of the element for which we are searching, then we can apply the above techniques to find an optimal G-D search time. Searching for an element at position i means starting the G-D game at i = j. This is equivalent to starting at i = N and doing N -j Decrements. By Theorem 2 the best way to continue from this point is G'D' for some k. Thus it is only necessary to compute the optimal number of Guesses, k = r(j). By modifying Theorem I it can be shown that r(j) = e(j),FN where

0

CW =

if j < -~[2-N

(j- l)-',1N--{e(j-')cU),~N

X [I _ C(j)21 _ 11 _ 0(11.,[N) if ,12N < j '< 0 (-,IN--)

- 0(11,.,IN--) if j > 0(~,IN--).

It is easy to determine c(j) numerically and a graph of the function (e.g., see Figure 2) shows that (once nonzero) c(j) approaches I exponentially fast.

The G-D game can be modified in other ways. One particularly interesting modification allows the player to use the information about when a random sample is successful (i.e., closer to the target). For example, suppose the value i is contained in a black box that is connected to a light that turns on every time i is decreased. At first glance, it appears as though such information could be quite useful in planning when to stop Guessing and start Decrementing. For instance, if the light flashed on for a series of early Guesses, then the player might be led to believe that the early Guesses were very good and thus that i had become very small. Hence, the player might

I

V

i

I

I

i

i

A Randomized Data Structure for Ordered Sets 427

think it wise to start Decrementing early. This is not the case, however, since if all the Guesses are required to be distinct, then it can be shown that the light does not add any useful information at all. It is worth remarking that the likelihood of two Guesses being identical is small and thus the constraint that all the Guesses be different has a negligible effect on the final result.

Such a counterintuitive result requires some justification. First notice that when all the Guesses are distinct, the sequence of Guesses is just a permutation of a subset R of 11,. . . , N 1. Every sequence of Guesses produces a unique light sequence, P, but one light sequence can be produced by many different Guess sequences. In particular:

LEMMA 5. For every light sequence P of length k there exists an integer m, such that for any set of Guesses R 9; 11, . . . , N I of size k there are exactly m permutations of R that have light sequence P.

Proof. Let K = {1, . . . , kJ and set m to be the number of permutations of K that produces the light sequence fl. Now consider any other subset R of length k. There is a I - I order-preserving mapping between K and R, so there is also a I - I mapping between the permutations of the two sets that preserves light sequences. This means that there are also m permutations of R that fit

We can now prove

THEOREM 3. When all the Guesses in the G-D game are distinct, then the light sequence adds no extra information about the value of i after a sequence of Guesses.

Proof. It is sufficient to show that the probability Pr [i = j I #I that i = j after k guesses given a light sequence fl, is the same as the probability Pr [i = j] that i = j after k guesses (with no knowledge of the light sequence). From the definitions and Lemma 5,

Pr [i = j I fil = # of sequences of Guesses that fit P for which i = j)l # of sequences of Guesses that fit P)

= (# of R for which i = j)ml(# of R)m

= (# of R for which i = j)k!l(# of R)k!

= (# of sequences of Guesses for which i = j)l

(# of sequences of Guesses)

= Pr[i = A. El

428 J. L. BENTLEY et al.

The G-D game might also be played with two different operations

0 , and 02 - It would be interesting to know what properties of

0, and 02 give an optimal sequence of the form ok 0,, for some

1 2

k and m. In addition to operators that act directly on i, we might

also consider comparison operations that compare i to a given

input n, and answer the question, "i < n?" If the compare operation

i < ,f2-N is added to G-D, then the optimal strategy is O(N 114)

Guesses followed by a Compare, repeated until i <,,~[2_N, ending

with Decrements. This strategy uses V_2N + O(N 114 ) expected steps.

ACKNOWLEDGMENTS

We would like to thank Gary Miller, Ron Rivest, Jim Saxe, and Mike

Sipser for helpful discussions.

This research was supported in part by ONR Contract N00014-76C-0370, NSF Grant MCS-78-07736, and an NSF Presidential Young Investigator Award with matching funds from Xerox and IBM. Parts of this chapter were presented at the 19th and 20th Annual Allerton Conferences on Communication, Control and Computing [BSS, LLI.

REFERENCES

[AHU] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974.

[BSS] J. L. Bentley, D. F. Stanat, and J. M. Steele, "Analysis of a randomized data structure for representing ordered sets," Proc. 19th Annu. Allerton Conference on Communication, Control, and Computing 364-372 (1981).

[JI] W. Janko, "A list insertion sort for keys with arbitrary key distribution," ACM Transact. Math. Software 2: 143-153 (1976).

[J21 W. Janko, "An insertion sort for uniformly distributed keys based on stopping theory," Int. Comp. Symp. April 1977, pp. 373-379. North-Holland Publishing, Amsterdam, 1977.

[K] D. E. Knuth, The Art of Computer Programming. Vol. 3: Sorting and Searching. Addison-Wesley, Reading, MA, 1973.

[LL] T. Leighton and M. Lepley, "Probabilistic searching in sorted linked lists,"

I-- ;_ i., r-rol and Com-