Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

LOGNORMAL LAW FOR A RENORMALIZATION CHAIN ARISING IN SEARCH

THEORY AND THE MODELLING OF DESCENT ALGORITHMS

J. Michael STEELE

Prograrn in Engineering Statistics, Princeton Universitv, Princeton, NJ 08544, USA

Received May 1986

Revised September 1986

A limit theorem is established for the asymptotic state of a Markov chain arising from an iterative renormalization. The limit theorem is illustrated in applications to the theory of random search and in probabilistie models for descent algorithms. Some special cases are also noted where exact distributional results can be obtained.

Markov chain * lognormal * descent algorithm

1. Introduction IrX 2 and f (x) = 2 irx, whence it is clear that 1 Z,,

and X,, have the same probability law.

To illustrate the lirnit theorem which is estab The random variables 1 Zn 1 can also serve to

lished here we first consider a simplified question aid our intuition about the general asymptotic

from the theory of random search. We suppose behavior of X, An easy calculation shows E 1 Z,, 1

that we seek to find an object which is located at = (2/3)", and this suggests that Xn is best studied

unknown position P. The tools available are as on a logarithmic scale. Still, it is perhaps surpris

sumed to allow us to get closer to P with each ing that for a wide class of functions f, the scaled

probe. The natural goal is to understand how far random variables Y. = logX,, are sufficiently trac

away from the target we will be with the nth table to allow a central limit theorem.

probe. Models with the renormalization form given

To specify a model for this process we suppose here are quite natural. The random variable X,,,

that f : [0, 11 > R is a nonnegative integrable is just a weighted random choice from that mass

function and F(x) is the integral of f from 0 to x. of f which is left over after the selection of X,

If Xn = x is the distance from target of the nth The defining renormalization relation (1.1) is also

probe, then the distance from the target of the closely connected with the basic models of choice

n + Ist probe is assumed to satisfy theory (see, e.g., Luce [31 and Steele [41).

P(X,,+1 E dy 1 Xn = x) =f (y)IF(x), Probably the most intriguing application of re

normalization chains is io the theory of descent

0

situation where Z1 is the radius of a point chosen First, we will establish the main results.

randomly from the unit disc, Z2 is a point chosen

randomly from the disc of radius 1 Z1 1, and

successively Z,, is chosen at random from the disc 2. Approximate recursion for the characteristic

of radius 1 Z,,, 1. To put this in the notation of function

the more general search model, we can let F(x) =

Just as the asymptotics of the mean for the

Research was supported in part by National Science Founda simple example on the disc illustrates that X~

tion Grant no. DMS8414069. should be studied in the log scale, the relationship

01676377/86/$3.50 0 1986, Elsevier Science Publishers B.V. (NorthHolland) 237

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

f (x) = x (or, more generally, f (x) = x') provides bine eqs. (2.1) and (2.2) to obtain

a leading case for the study of a characteristic

function associated with X, Since our main result

contains what m~glit otherwise appear to be a =f e"'ogxg,,(x)h(x)dx, (2.3a)

technical hypothesis, we will first derive some 0

formulas which, together with the leading case where

f (x) = x', will motivate the final result. X itlog(ylx) fly) y~( 1)

We will let (p~(t) be the characteristic function h(x) e a + ~dy.

of log X. and we will aim toward a recursion fo F(X) x.+1

relation for the ( 0~ }. Letting g, denote the den (2.3b)

sity of X~, we can calculate We can now use (2.3) to get an effective mea

fl e itlogyp(x n E= dy) sure of the extent to which f(x) differes from x".

0 We have

1 e itl,)g,~f 1 g~I(X) dx f (y) dy 1 e"'ogxg,(x)h (x)dx

if

0

:~ f'g~(x) 1 h (x) I dx

fo e't log x& 1 (X) 0

X~ xeit'09(Ylx)f(y) < f'g~(x)h*(x) dx = r, (2.4a)

fo F(x) dy~ dx. (2.1) where 0 we define

This expression would be a recursion relation if (xl f (Y) (a + 1)yC,

the bracketed integral were independent of x This h*(x) = JO 1 F(x) X a+l dy. (2.4b)

will be seen to be approximately the situation. We

first explore the inside integral in the leading case The quantity h* captures just what we need to

f(x) = x', a > 1, know about the approximability of f(x) by x'.

fX "log(ylx)f(y)IF(x)} dy We now use the defining recursion for & to

te calculate a recursion for r,, as follows:

= (a + 1) fo Xe itlog(Y/X)Yaxa1 dy r. = fo 1 g,,(x)h*(x)dx

=(1+a) le"'oluuadu =fl{ 1 ni(z) AX) dz C(x)dx

X

f' 0 f 9 F(z)

0

+ it C. (2.2) =f'gnl(z){ 1fzf(x)h*(x)dx}dz.

1 + a

0 F(z) . o

In the motivating case of 1 Z,, 1, we saw a (2.5)

so, the recursion (2.1) and the identity (2.2) give us So, if there is a 0 < X < 1 such that

that the characteristic function of log 1 Zn is ex 1 z

actly equal to (1 + it12) ', i.e., log 1 Z,, 1 is equal jF(

z) f f (x)h*(x)dx:::~ XC(z) (2.6)

to the sum of n exponential random variables 0

with mean 1. Naturally, this result can be ob for all z E= [0, 11, then

2

tained more directly, but the explicit example rn:5Xr,,, and r,,

our main calculation. As a final note on the exam The recursion relations (2.3a) and (2.7) can be

ple, we see that log E Z,, 1 = n log(2/3) is con used to approximate 0,, in terms of powers of 4~

sistent with E log 1 Zn 1 = n/2 since Jensen's in by noting

equality states E log 1 Z,, :~ log E 1 Z,, 1 and indeed X~lro,

0.5 < log(2/3) = 0.405. ~2(t)Sbn_2(t) 1 2r01

We now let %p (t) = (1 + itl(l + a))' and corn < rn 2

238

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

and generally, Here one should note that there is nothing

special about the interval [0, 11. It was chosen just

to keep notation clean. Also, the continuity of h*

:c~ r, X" Mro guarantees that if (2.10) holds, then the nominally

Summing the expressions above for m between 1 stronger uniform inequality (2.6) must hold as

and n k, we have well. The effective use of the preceding result will

often require a proper choice of scale in order to

(t) ~n ( t ) Ok ( t t r.Xk (1 _X) be able to establish (2.10), and an instance of such

(2.8) change of scale is given in the course of the

examples detailed in the next section.

Here one should note that the characteristic

function 4i(t) does not vanish for any real t,

~'(0) = i (1 + a)', and ~"(0) ~ 2Q + a) 2. If 3. Random descent algorithms

W is an exponential random variable with mean 1A

equal to (1 + a) 1 and variance U2 equal to (1 + If 0 is a convex function on R d' then a se

a)2 then ~ is the characteristic function of W. quence of vectors x. E= R d' 1 :5 n < oo, is said to

We see therefore that ~ (t(l + a))e" is the char satisfy the descent condition with respect to 0

acteristic function of a random variable with mean provided O(X~,1)<(P(Xn). Many important al

zero and variance 1. By the central limit theorem, gorithms are known to produce sequences which

we then get that ~"(t(l + a)ln'l')e"~n converges satisfy a descent condition for an appropriate

to e"/' for all t, so (2.8) implies for all fixed k choice of 0, and the descent property is one of the

that basic tools for analyzing numerical algorithms (cf.

~ On + a) 12/2 Luenberger [21). The EM algorithm is an example

limsup ~ e'tvfw e of such a descent algorithm currently of great

noo Cn interest in statistics (see, e.g., Dempster, Rubin

:~ ro?~(1 X),. and Laird [11 and Wu [5]). In addition to calling

Since k is arbitrary, the required limit theorem is attention to the presence of the descent condition,

established. We can now summarize the main our reason for singling out the EM algorithm is

that it is known to exhibit linear convergence, as

result. opposed to the quadratic convergence which is

typical of Newton method based algorithms. The'

Theorem. If f : [0, 11 R is a nonnegative function lognormal law which was established in the pre

and F(x) = foxf(y)dy < m, 0::5 x:!~ 1, then for the ceding section will now be used to show that, in

Markov chain { Xn ) 'n 0 defined by the kernel general, random descent algorithms have a very

K(x, y) = P(X,,,l E dy 1 X,, = x) =f (y) 1 F(x), specific type of linear convergence.

0 < y < X, Here, for specificity, we will assume that 0 ~t 0

and that 0 is the minimum value attained by 0 in

we have the convergence in distribution of the convex set {x: 0(x):~, 1). We can then define

log X,, + nl(l + a) a random descent sequence by choosing a point

Cn /(1 + a) (2.9) U, at random from {x: 0(x):!s~ 1}, and then

successively choosing U

,, , 1 at random from

to the standard normal, provided only that the {X: 0(x):!~ OWM.

function If we let tt denote Lebesque measure on R

1xl f (y) (a + 1)y` then we will shortly show that the main theorem

h*(x) JO jF( 1d of this note establishes that X,,= g{x:0(x):5

x) X,, Y

O(U

,,)} is asymptotically lognormal for a large

satisfies the integral inequality class of convex functions, 0.

We first examine the special case of quadratic

f Z f ( x ) h * (x ) d x < h z (2.10) s5. In this instance the expression log Xn can be

F(z) 0 shown to have exactly a gamma distribution.

for all 0 < z ::~, 1. If Q is a syrnmetric positive definite matrix, the

239

1

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

quadratic form x""Qx defines a convex function 0 hold for all z in the restricted range (0, 81.

which can also be written as ~(x) = xTMTMx The fact that h*(x) = 0(x') and the identity

where M is a nonsingular linear mapping of Rd (3.2) let us cheek easily that (3.4) holds for a

onto R'. Setting F(t) = p { x : (p (x) n~ t }, we see sufficiently small 8 > 0. By the equivalences just

F(t) = gM'(x: xTx:s~ I} = 0?d td (det M)' established, f, and h* satisfy eq. (2.10).

If fa is associated with the Markov chain Y,, by

~ W did(det Q) 1/2, (3.1) the basic relation (1.1), then Y,, has the same

where wd is the volume of the unit sphere in R". distribution transition kernel as 81X, for X,, :~ 8.

By the characteristic function calculation given by Since n is asymptotically lognormal and since

eq. (2.2), we see that log X, is the sum of n P(X, > 8) tends to zero, we see X,, is also asymp

independent exponentials with mean 11d. totically lognormal.

We will now pursue the general case and build To summarize, we have proved that if g and

on the fact that any regular convex function is satisfy (3.2), then the random variables defined

locally well approximated as a quadratic. Here, by

the notion of approximation needs to be devel Z, = log g {X: (P(X)

oped with an eye toward the condition (2.10); but, have approximate mean nId and asymptotic vari

2

in sympathy with the leading case (3.1), we will ance nId . Moreover, the standardized variables

first consider those convex 0 which as t 0 satisfy (Z~, nId)l ~nld2 are asymptotically normal.

the condition

f (t) = d X: S5(X) :~ t Ctd 1 + 0(td1

it 4. Concluding remark

(3.2)

where c and E > 0 are constants. In applications We have exhibited a lognormal law for the

where 0 is more general than the quadratic lead asymptotic state of a class of Markov chains which

ing to (3.1), one can still often obtain (3.2), even have a transition kernel defined by a natural re

with E = 1. The key issue is that of relating condi normalization. The Markov chain is motivated by

tion (3.2) with the more technical hypothesis (2.10) a simple search model and it also serves to model

the behavior of random descent algorithms. These

of the main theorem. 0(td+e), SO random descent models exhibit linear convergence

From (3.2) we have F(t) = ctdId + of a very precise type which is made explicit by

setting a = d 1, we note that for 0 < y < x the lognormal law. It would be of interest to

f (y) (a+') Y"~ = O(Y a+e /X a+l (3.3) provide a random descent model which could

F(x) Xa+1 exhibit the quadratic convergence rates typical of

Newton methods.

and for h * (x) defined as in (2.4b) we have h * (x)

O(X').

If we let fa (x) = f (6x), then the corresponding References

expressions for F. and h* satisfy F8(x)

foxfa (u)d u = 8 'F(8x) and [1] A.P. Dempster, D.B. Rubin and N.M. Laird, "Maximum

likelihood from incomplete data via the EM algorithm", J.

hf xl fs(y) (a + 1)yO, dy Royal Statistical Society B 39, 122 (1977).

,(x

*(x) x)

0 Xa+1 [21 D.G. Luenberger, Linear and Nonlinear Programming, 2nd

ed., AddisonWesley, Reading, MA, 1984.

SA f (U) (a + 1) ua [3] R. D. Luce, Individual Choice Behavior: A Theoretical A nal

F(Sx) (8x)0+1 du h*(8x). ysis, Wiley, New York, 1959.

[41 LM. Steele, "Limit properties of Luce's choice theory", J.

Math. Psych. 11 (2), 124131 (1974).

The basic convergence hypothesis (2.10) ap [51 C.F.J. Wu, "On the convergence properties of the EM

plied to f& is therefore equivalent to the condition algorithm", Annals of Statistics 11, 95103 (1983).

that the integral inequality

f Z f(u)h*(u)du

240