4

A RANDOMIZED DATA STRUCTURE FOR ORDERED SETS

jon L. Bentley, F. Thornson Leighton. Margaret Lepleyl Donald F. Stanat, and J. Michael Steele

ABSTRACT

In this chapter, we consider a simple randomized data structure for representing ordered sets, and give a precise combinatorial analysis of the time required to perform various operations. In addition to a practical data structure, this work provides new and nontrivial probabilistic lower bounds and an instance of a practical problem whose randomized complexity is provably less than its deterministic complexity.

1. INTRODUCTION

In this chapter, we consider the problem of maintaining a set from a totally ordered domain under the operations Member, Insert,

Advances in Computing Research, Volume 5, pages 413428. Copyright 0 1989 by JAI Press Inc. All rights of reproduction in any form reserved.

414 J. L. BENTLEY et al.

Delete, Predecessor, Successor, Maximum, and Minimum. The basic data structure that we use to represent such a set of size at most N is a sorted linked list implemented by the two arrays Link [0 ... NI and Value [1 ... NI. The value in Link [0] points to the first element in the list, Link [Link [Ofi points to the next element, etc. We call such a data structure a J1ist after Janko [J1, J2] who first studied the structure in a randomized setting. Among other things, we will show that Member, Insert, Predecessor, Successor, and Maximum can all be accomplished in 2,,1N .c expected steps where c is a small constant, and that this bound is optimal under a plausible model of constraints imposed by the data structure. We also show that Delete requires just 4.,fN  2c expected steps and that M i n i m u m requires just one step (Li n k [01 points to the minimum element). All of these bounds (except for Minimum) are dramatically better than the worst case bound of N steps.
Although quite simple, the J1ist data structure is surprisingly efficient. In fact, it is superior to all of those described by Knuth [K] for certain applications. The salient attributes of such applications are as follows:

• Space is important. This structure uses only one extra word of storage per element, while binary search trees use at least two extra words, and various hashing schemes use varying amounts of extra storage. However, the storage for this structure must be available in a single contiguous block.
• The "orderedness" operations of Successor, Predecessor, Minimum, and Maximum are frequent; these are not possible in most hashing schemes.
Insertions and deletions are frequent. If the data structure changes rarely, binary search in a sorted array is very efficient. Program simplicity is important. Each operation on this structure requires only about a dozen lines of code, while some operations on balanced binary search trees require over 100 lines of code.
Run time is important for problems of medium size (where medium means that N is between, say, 100 and 10,000). If N is below that range, simple sequential strategies are probably efficient enough. If N is above that range, then the logarithmic search time of binary search will be necessary for many applications. When N is in the medium range, though, the low constant factors of this structure will make it competitive with binary search trees.

A Randomized Dafa S"cfure for Ordered Sefs 415

Of course, the simple linked list is one of the most basic and wellknown data structures, and has arisen in countless contexts. Most relevant to this chapter is the prior work of Janko [J1, J2] who studied randomized algorithms for sorting using linked lists and obtained an O(N 112 ) bound on the expected time needed to sort N items.
The remainder of the chapter is divided into sections as follows. In Section 2, we define the problem more precisely and observe that the worstcase complexity of performing a Member Search is linear in N. For the most part, we concentrate our efforts on the analysis of a simple algorithm for Member Search. This is because the algorithm for Member Search can be easily transformed into an efficient algorithm for each of the other operations. In Section 3, we describe the Mist structure and explain its relationship to a simple GuessDecrement game. We also describe an optimal randomized algorithm for Member Search and show how it can be extended to form efficient algorithms for Insert, Delete, and the other oper.ations. Section 4 considers some natural extensions of the basic model and contains some additional probabilistic analysis.

2. THE PROBLEM AND ITS DETERMINISTIC
COMPLEXITY

The Mist is implemented in contiguous storage by the two arrays Link[O ... NI and Value [1 ... NI. The pointer Li n k [01 points to the first element of the list, Value [Link [011. The next element can be found in Value [Link [Link [0]11, and so forth. The end of the list is denoted by an element whose Link field contains  1. Furthermore, we will insist that the array is dense: Va 1 u e [ 1 ... NI must contain N elements of the represented set. The sortedness of the linked list implies that if Link [I] is not  1, then Value [I] < Value [Link [I]]. We will often refer to Value [I] and Link [I] together as node I. Figure 1 illustrates the array representation of the sorted linked list <2.6,3.1,4.1,5.3,5.8, 5.9,9.7>.
It is clear that performing a Member Search in such an array requires accessing at most N elements of the array (either by following Link fields through the list or simply by iterating through Value fields of the array). We will now show that in the worst case, this

Figure 1. An array representation of the sorted linked list ~2.6,3.1, 4.1, 5.3, 5.8, 5.9, 9.7 ~.

1 0 1 2 3 4 5 6 7
Value[II  3. 1 5.3 1 5.8 9.7 1
LinKE13 A 1 5 1 7 1 1 1 6 . 1 3 11

416 J. L. BENTLEY et aL

much time is necessary to decide whether a given element is in the list. We will assume that a (deterministic) search algorithm is composed of operations of the following types, each with unit cost.

1. Determine the index of the node at the head of the list (by accessing Link [01). There is one operation of this type.
2. Determine the successor of node I for 1 ,< I < N (by accessing Link (I]). There are N operations of this type.
3. Determine the Va 1 ue of node I, for 1 < I <.N (by accessing Va 1 u e [I]). There are N operations of this type.

(Note that if operations of type 2 have no cost; then binary search can be used to solve the problem in logarithmic time.)
Our model assumes that a protagonist specifies a sequence of the above operations while an adversary ensures that N operations will be required. We will assume that the adversary knows the value of the key the protagonist seeks, which we will call V, and that other key values may be assigned arbitrarily by the adversary. We will describe a strategy that enables the adversary to delay returning V until the protagonist has specified a sequence of N operations. Without loss of generality, we will assume that whenever one of Value[I] or Link[l] is asked, both of Value[I1 and Link[I1 are provided at a cost of a single step (1 < I < N). The value of V will be the maximum element in the list. There are two cases depending on whether or not the protagonist asks the type 1 question.

Case 1: The type 1 question is not asked.

A Randomized Data Structure for Ordered Sets 417

The argument continues in this fashion until N  1 nodes have been queried. At this point, the remaining unqueried node is I = Link [i] where J is the largest queried node. In order to resolve Member Search for V, the protagonist must still ask the value of node I, making for a total of N queries overall.

Case 2: The type 1 question is asked.

3. RANDOMIZED ALGORITHMS

In what follows, we focus on algorithms that allow probabilistic access to nodes in addition to deterministic and Link access. Although the worst case performance of such randomized algorithms is no different than that for deterministic algorithms, we will find that the average case performance is much better.
The section is divided into subsections as follows. In Section 3. 1, we define a class of simple randomized algorithms for Member Search. We model the performance of algorithms in this class with a GuessDecrement game in Section 3.2. In Section 3.3, wetse the game model to show that the expected running time for the optimal Member Search Algorithm is 2,fN  c, where c is a small constant. We extend the algorithm for Member Search to other operations in Section 3.4.

418 J. L. BENTLEY et al.

3.1. A Class of Randomized Algorithms for Member Search

By combining probabilistic access with access by predecessor link, a wide range of algorithms can be considered. For example, the following pseudoPascal program searches for the element E, using the order of operations specified by the array Step. When Step [J] is zero, a random sample occurs. Otherwise the program follows the next link in the list. Note that when a random sample is chosen, the position in the list is updated only if the random position is closer to (but not at or beyond) the location of E. This strategy ensures that the updated position in the list never worsens and that when E is eventually found, its predecessor will also have been found (since E will have been reached via a link). (This particular code assumes that Va 1 u e [01 is  oc) and Va 1 u e 11 is 00)

p = 0 J= 0 do until exit J= J+ 1 if Step [J] = 0 then do R= Random(l, N) if Value [R] < E and Value [R] > Value [P] then.P R else do if Value [Link [P]] = E then exit (*E is at link [P]*) if Value [Link [P]] >.E then exit (*E is not in the list*) if Value[Link[P11 <.E then.P:= Link[P]

For any specified Step array, the expected performance of the associated algorithm will depend on the value of E being searched. For example, if E is less than or equal to the smallest item in the list, then the algorithm will terminate on or before the first Link access. For the time being, we will focus on the more interesting case when E is bigger than the largest item in the list. This is, in fact, the worst case for any step array in terms of expected running time, and is representative of the case when E has a random rank. For expediency, we will defer the proof of these assertions to Section 4, where we consider search values with an arbitrary index. We also consider more sophisticated algorithms in Section 4, including procedures that decide whether to step forward or move randomly based on whether or not previous random moves were successful.

A Randomized Data Sfrudure for Ordered Sets 419

3.2. Modeling Algorithms as Strategies

Before proceeding to construct an optimal algorithm from the class described in Section 3. 1, it is useful to associate algorithms in the class with strategies for a simple probabilistic "GuessingDecrement" game. The GD Game involves two integers, i and N. The value of N remains fixed throughout the game, and the value of i is originally N. The goal of the player is to reduce the value of i to zero in the minimum expected number of steps. A step consists of performing one of the following two operations:

• D (for Decrement): If i > 0, then replace i by i  1.
• G (for Guess): Choose j to be an integer uniform from 1 ... N and replace i by j if j < i. The value of i is unknown to the player, except at the beginning of the game when i = N, and at the end when he is notified that i has reached zero.

The value of i represents the distance from the current element in the linked listdenoted by P in the above algorithmto the end of the list. The value N is the number of items in the list. We start with i = N because we assume that we are searching for the largest element in the list. In the general case, we would start with i equal to the rank of E, if known.
Each Guess corresponds to a random access in the above code, whereas each Decrement corresponds to a link access. A strategy or sequence of operations will be denoted by a character string a composed of G's and D's to be performed in order from left to right. A sequence of G's and D's corresponds naturally to a Step array. A sequence is said to be complete if it contains at least N D's. Note that operations written after the first N D's are superfluous and need not appear. For convenience, however, we will often end
N
complete sequences with D
A complete sequence will always reach i = 0 and terminate the game after some number of steps t. The expected termination point for a complete sequence a is denoted by

w .0
E(a) E j Pr [t = i 1 E Qi (U)
j=1 j=0

where Qj (a) = Pr [t > j ]. The object is to minimize E(a) over all complete sequences a. We denote the minimum by S(N).

420 J. L. BENTLEY et al.

Note that E(u) also denotes the expected running time of the Member Search algorithm for the corresponding Step array. Hence, determining the optimal a is equivalent to determining the optimal algorithm from the class described in Section 3.1. For simplicity, we will use the GD notation henceforth.

3.3. An Optimal Strategy

The task of finding an optimal strategy for the GD game will proceed in two steps. The first step consists of finding the best strategy from among those of the form G'D' for some k. The second, and more difficult, step consists of showing that there is an optimal strategy having this form.
We will start by analyzing strategies of the form G'D'. Within this restricted class, it is easy to determine the best values of k and the minimum of E(G k D).

THEOREM 1. Let k(N) be the value of k that minimizes E(GkD'). Then k(N) = ..,1N  1  11(24~,IN) + O(IIN).

Proof. From the definition,

k+N
E(G k D N) = 1 Qj(G k D N).
j=0

Since the first k operations are Guesses the game cannot end there,
so Qi (G k D N) = 1 for j = 0, . . . , k. The probability of not terminat
ing during the first d Decrements is (N  d)k N k, since all the
Guesses must be larger than d in order not to terminate. Therefore

N1
E(G k D N) k + 1 + 5~ (N  d)k N k
d=l

N1
k + 1 + N' 1: d k
d=l

k + 1 + N1(k + 1)  1/2 + kl(l 2N) + 0(k2IN2).

The minimum occurs when

1  N1(k + 1)2 + 11(12N) + 0(kIN') = 0

1

A Randomized Data Sfructure for Ordered Sefs 421

and thus when

k = ,IN  1  1/(24~,/N) + O(IIN).

Theorem 1 provides an upper bound of S(N) < 2~N  1/2 + 11(12,jWN) + O(IIN) expected operations for the GD game. Of course, the optimal value of k = ,,IN  1  1/(24~,/N) + O(IIN) may have to be rounded to a neighboring integer, so we should conclude only that S(N) <, 2,~lN  c where c is a small constant that tends to 1/2 as N grows large. In what follows, we will show that this bound is tight by proving that there is an optimal strategy of the form G k D'. We commence with some definitions and lemmas.
When a sequence co is not complete, the value of i after the operations in w have been performed may remain undetermined. Instead of knowing the exact value of i we define a probability vector

Pj (co) = Pr [i > j after executing the sequence w]

We can see from the definition that Pj (co) >, Pj (co). Moreover the vector can be computed for any sequence (A).

a b, b,
LEMMA 1. For any sequence w = G 'D ... G'D ' with b
b, + b, and a = a, + + a,

(N j  b) a, (N j  b + b, )a,

Pj ((J)) (N j  b,)asN a forj < N b
10 for j >, N  b.

Proof. During each block of Guesses, G', the value i must remain abovej plus the number of D's that are still to be performed, b,,, + . .. + b, The probability that all the Guesses in the block are betweenj + bm + + b, + 1 and N is (N j  bn bs)Na.

These probabilities are important in determining the optimal strategy. The following lemma states one of the most useful properties of this vector.

422

J. L. BENTLEY et al.

LEMMA 2. For any sequence co, Pi (coD)/P, (coD) < Pj (a))1P0 (co) for 0
Proof. Each ratio is a product of terms of the form

[(N j  bi     bjI(N  bi

...  b,)],2i.

The sequence wD has one more D in the last block than co, so b, increases by one in coD. When PO (coD) > 0, a comparison of these terms, letting K = N  bi  . ..  b, reveals that *

[(K  j  1)l(K  1)]a' < [(K  j)1K]`

and thus that Pj(coD)/Po(coD) < Pj(w)/Po(w). If Po(coD) = 0, we
define the ratio to be zero and the inequality still holds. n

Remember that our goal is to prove that the optimal strategy has the form G'D'. To do this, we next analyze the effect of minor variations in strategy on the expected number of operations. Then we will show that if a small variation improves the strategy, then a larger change could mean even more improvement. The two sequences which we will compare first are a = coDG'D' and a* = coG'D'. The only difrerence between u and a* is the position of the block G'. The following lemma gives a method for comparing these two strings.

LEMMA 3. E(aDG k D N) < E(coGk D N) if and only if Vk(co) < 1, where

N1
P 1 GO)/P0 (C0) + 1: Pd ((O)/P0 (0)
d=l

Vk x [(N  d + 1)k _ (N  d)k lk'N k if po(o,)) > 0
10 if PO (C0) = 0.

Proof. Let u and u* be defined as above. We would like to know when E(u)  E(u*) <, 0. The following values for Q, (u) and Q,, (o*) can be easily verified.

Q. (a) =

Q.(u*)
P, (C0)

Pm.lcolk(o))

if m < 1(01 if jcol + 1 < m
,< lcol + k + 1

x (N m + lcul + k + 1)k N k if m > io)l + k + 'L

1

A Randomized Data Strudure for Ordered Sets 423
QJU) if m <, 1(01
N(O) if 1C01 + 1 Q. (U*) = < lcol + k

Pm  leol  k (C0)

x (N  m + icol + k)k N k if m > lcoi + k.

Thus E(u)  E(u*) = E' 0 [Q,, (a)  Qm (a*)] <, 0 is equivalent to
M=

N1
(k + I)PI kPO (co) + 1 Pd (co) [(N  d + 1)k
d=l

(N d )k IN  k p 1 (C0) < 0.

Rearranging terms slightly gives

N1
P 1 (C0) + E Pd (C0) RN  d + 1)k (AT  d)k IN k k` < Po(co).
d=l

Combining Lemmas 2 and 3 enables us to extend a minor variation of the string into a more radical change. Specifically, if the last block of G's is more efficient when it is moved to the right one place, then it is best to remove the block of Guesses altogether.

LEmmA4. For all co, ifE(coDGkDN) <, E(co&D'), then E(CODN) E(coDGk D N).

Proof. If EGoDiG k D N) < E(coDi 'G k D N) for some j >, 1, then Vk (coDi `) <, 1 by Lemma 3. By Lemma 2 and the definition of Vk(co), we know that Vk(wDi) <, Vk(coDi'), and thus that Vk(wDi)<,l. Thus, we can conclude by Lemma 3 that E(coDi + 1 G k D N) < E(coDI G k D N ). The proof of the lemma is completed by applying this process inductively.

it is now a simple matter to prove our main result.

THEOREm 2. For every starting sequence co, there exists an integer r >, 0 such that EGoU D') <, E(cou) for every completed sequence
COU.

Proof. Let a denote the shortest (in length) sequence for which
wa is complete and E(cou) = minE(coy). If Gr is of the form G'D N

424 J. L. BENTLEY et al.

then we are done. Otherwise u = co*DG'Di, where G' is the last
block of Guesses and j > 0. By the optimality of a, we know
that E(coco*DG k Di) < E(c~*G k Di `). Thus by Lemma 4, we
know that EGoco*Di`) < E(cow*DG'D") which contradicts the
minimality of u. n

Thus the best way to finish any initial sequence is by a block of Guesses followed by Decrements. By letting co be the empty string, we find that the optimal strategy for the game is G k pl . Recalling Theorem 1, we find that the optimal value is near ,IN  1 11(24,,1N) + 0 (1 IN) and thus S (N) = 2,,1N c where c is a constant that tends to 1/2 as N gets large. Hence the expected number of steps for Member Search is at most 2,/V,As a consequence, it is not difficult to show that this is within one or two steps of optimal whenever we are searching for an item that is bigger than the median. Searches for items less than the maximum are discussed more thoroughly in Section 4.

3.4. Algorithms for the Other Operations

Thus far we have considered only the problem of searching the linked list to determine if it contains a given element. It is easy to perform many other set operations on this structure. The following list summarizes those operations, and describes their costs in terms of the number of Value elements accessed.

• Member: The previous sections studied the problem of searching to determine whether a given element is a member of the set represented by the linked list. Cost: 2,fN
• Insert: A new element can be inserted in the list by using Member Search. Cost: 2,1N_.
• Delete: The first step in deleting an element is to find that element by a search algorithm, and then modify the Link field of its predecessor to point to its successor. This takes 2,,,1N references to the Value array. The next step must patch the "hole" created in the dense array by moving the last element of the array to the vacant position. Searching for the last element requires 2,,,1N references. Cost: 4,,1N
• Predecessor: The element immediately preceding a given element can be found by a simple modification to the Member Search algorithm. Cost: 2.j.

1

A Randomized Data Structure for Ordered Sets 425

• Succeessor: The element immediately succeeding a given element can be found by a simple modification to the Member Search algorithm. Cost: 2.,,lY
• Minimum: The minimum element in the set is pointed to by Link[O]. Cost: 1.
• Maximum: The maximum element in the set can be found by searching for infinity. Cost: 1JY

Each of the above operations is straightforward to implement given the Member Search algorithm and basic techniques for dealing with data structures for searching (described, for example, by Knuth [K]). Furthermore, the simplicity of the algorithm implies that the constant factors in the running time of the program will be relatively small. The only deviation that a programmer should make from the Member Search algorithm deals with therandom number generation: since some random number generators are very slow, it might be preferable to use some other approach to sample the k elements.

4. EXTENSIONS AND REMARKS

At first glance, it might appear that our proof technique depends on the fact that Guesses do not yield zero. This is not the case. When a Guess returns an integer uniform on [0 ... N], the optimal strategy differs from the above strategy by only 0(11.,,IN) Guesses.
Our analysis of the GD game was for the worst case task of searching for the final element in the list. We will now consider searches for a random element. The value i can be interpreted as the number of links between our present position in the list and the position of the element we are searching for. When searching for the last element we start at i = N. Suppose a Member search seeks an element whose position is unknown and randomly distributed. We should then start at an unknown, random i, or equivalently start at i = N and do one Guess to randomize i before counting operations. By Theorem 2 the optimum strategy is then G"D' as opposed to G'D' when searching for the last element.
Sometimes it is useful to find the j th element given j. When j is small it is easy to follow links and when j = N we can use the GD strategy. Between these two extremes, the strategies used thus far are not necessarily valid since the value of the jth element is

1

426

J. L. BENTLEY et al.

Figure 2. Graph of c(j) and S(N, j)l.,,IN where S(N, j) is the optimal expected number of operations when searching for the jth element.

S (N. j)1J N

F  1 9 1 5
1 2 2 3 4 i/IN

unknown. But if we know the value as well as the position of the element for which we are searching, then we can apply the above techniques to find an optimal GD search time. Searching for an element at position j means starting the GD game at i = j. This is equivalent to starting at i = N and doing N j Decrements. By Theorem 2 the best way to continue from this point is G'D' for some k. Thus it is only necessary to compute the optimal number of Guesses, k = r(j). By modifying Theorem 1 it can be shown that r(j) = c(j),/WN where

0

CU) =

if j < jN

(j {e(i  1)c(j),FN

X [1 C(j)21 _ 1} _ 0(11_'VIN) if j2N < j < 0 (_",1IN)

1  0(11~,IN)

if j >> n(jN).

It is easy to determine c(j) numerically and a graph of the function (e.g., see Figure 2) shows that (once nonzero) c(j) approaches 1 exponentially fast.
The GD game can be modified in other ways. One particularly interesting modification allows the player to use the information about when a random sample is successful (i.e., closer to the target). For example, suppose the value i is contained in a black box that is connected to a light that turns on every time i is decreased. At first glance, it appears as though such information could be quite useful in planning when to stop Guessing and start Decrementing. For instance, if the light flashed on for a series of early Guesses, then the player might be led to believe that the early Guesses were very good and thus that i had. become very small. Hence, the player might

1

A Randomized Data Sfructure for Ordered Sefs 427

think it wise to start Decrementing early. This is not the case, however, since if all the Guesses are required to be distinct, then it can be shown that the light does not add any useful information at all. It is worth remarking that the likelihood of two Guesses being identical is small and thus the constraint that all the Guesses be different has a negligible effect on the final result.
Such a counterintuitive result requires some justification. First notice that when all the Guesses are distinct, the sequence of Guesses is just a permutation of a subset R of { 1, . . . , N}. Every sequence of Guesses produces a unique light sequence, P, but one light sequence can be produced by many different Guess sequences. In particular:

LEMMA 5. For every light sequence P of length k there exists an integer m, such that for any set of Guesses R g {1, . . . , N} of size k, there are exactly m permutations of R that have light sequence

Proof. Let K k} and set m to be the number of permu
tations of K that produces the light sequence P. Now consider any
other subset R of length k. There is a 1  1 orderpreserving mapping
between K and R, so there is also a 1  1 mapping between the
permutations of the two sets that preserves light sequences. This
means that there are also m permutations of R that fit fl.

We can now prove

THEOREM 3. When all the Guesses in the GD game are distinct, then the light sequence adds no extra information about the value of i after a sequence of Guesses.

Proof. It is sufficient to show that the probability Pr [i j 1 fil that i = j after k guesses given a light sequence P, is the same as the probability Pr [i = j 1 that i = j after k guesses (with no knowledge of the light sequence). From the definitions and Lemma 5,
Pr [i j J fl] = (4 of sequences of Guesses that fit P for which i = j)l
( # of sequences of Guesses that fit P)
= (# of R for which i = j)ml(# of R)m

= (# of R for which i = j)MI(# of R)k!
= (# of sequences of Guesses for which i = j)/
(# of sequences of Guesses)
= PrU = il.

428 J. L. BENTLEY et al.

The GD game might also be played with two diflerent operations
0 , and 02. It would be interesting to know what properties of
0, and02give an optimal sequenoe of the form O'o', for some
1 2
k and m. In addition to operators that act directly on i, we might
also consider comparison operations that compare i to a given
input n, and answer the question, "i <, n?" If the compare operation
i < J2N is added to GD, then the optimal strategy is O(M1')
Guesses followed by a Compare, repeated until i < _,12N, ending
with Decrements. This strategy uses.,,f2N + O(N'I') expected steps.

ACKNOWLEDGMENTS

We would like to thank Gary Miller, Ron Rivest, Jim Saxe, and Mike