## Please see PDF version

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

LOGNORMAL LAW FOR A RENORMALIZATION CHAIN ARISING IN SEARCH
THEORY AND THE MODELLING OF DESCENT ALGORITHMS

J. Michael STEELE

Prograrn in Engineering Statistics, Princeton Universitv, Princeton, NJ 08544, USA

Received May 1986
Revised September 1986

A limit theorem is established for the asymptotic state of a Markov chain arising from an iterative renormalization. The limit theorem is illustrated in applications to the theory of random search and in probabilistie models for descent algorithms. Some special cases are also noted where exact distributional results can be obtained.

Markov chain * lognormal * descent algorithm

1. Introduction IrX 2 and f (x) = 2 irx, whence it is clear that 1 Z,,
and X,, have the same probability law.
To illustrate the lirnit theorem which is estab The random variables 1 Zn 1 can also serve to
lished here we first consider a simplified question aid our intuition about the general asymptotic
from the theory of random search. We suppose behavior of X, An easy calculation shows E 1 Z,, 1
that we seek to find an object which is located at = (2/3)", and this suggests that Xn is best studied
unknown position P. The tools available are as on a logarithmic scale. Still, it is perhaps surpris
sumed to allow us to get closer to P with each ing that for a wide class of functions f, the scaled
probe. The natural goal is to understand how far random variables Y. = logX,, are sufficiently trac
away from the target we will be with the nth table to allow a central limit theorem.
probe. Models with the renormalization form given
To specify a model for this process we suppose here are quite natural. The random variable X,,,
that f : [0, 11 > R is a nonnegative integrable is just a weighted random choice from that mass
function and F(x) is the integral of f from 0 to x. of f which is left over after the selection of X,
If Xn = x is the distance from target of the nth The defining renormalization relation (1.1) is also
probe, then the distance from the target of the closely connected with the basic models of choice
n + Ist probe is assumed to satisfy theory (see, e.g., Luce [31 and Steele [41).
P(X,,+1 E dy 1 Xn = x) =f (y)IF(x), Probably the most intriguing application of re
normalization chains is io the theory of descent
0 To be still more concrete, we can consider the developed here in the most detail.
situation where Z1 is the radius of a point chosen First, we will establish the main results.
randomly from the unit disc, Z2 is a point chosen
randomly from the disc of radius 1 Z1 1, and
successively Z,, is chosen at random from the disc 2. Approximate recursion for the characteristic
of radius 1 Z,,, 1. To put this in the notation of function
the more general search model, we can let F(x) =
Just as the asymptotics of the mean for the
Research was supported in part by National Science Founda simple example on the disc illustrates that X~
tion Grant no. DMS8414069. should be studied in the log scale, the relationship

01676377/86/\$3.50 0 1986, Elsevier Science Publishers B.V. (NorthHolland) 237

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

f (x) = x (or, more generally, f (x) = x') provides bine eqs. (2.1) and (2.2) to obtain
a leading case for the study of a characteristic
function associated with X, Since our main result
contains what m~glit otherwise appear to be a =f e"'ogxg,,(x)h(x)dx, (2.3a)
technical hypothesis, we will first derive some 0
formulas which, together with the leading case where
f (x) = x', will motivate the final result. X itlog(ylx) fly) y~(  1)
We will let (p~(t) be the characteristic function h(x)  e a + ~dy.
of log X. and we will aim toward a recursion fo F(X) x.+1
relation for the ( 0~ }. Letting g, denote the den (2.3b)

sity of X~, we can calculate We can now use (2.3) to get an effective mea
fl e itlogyp(x n E= dy) sure of the extent to which f(x) differes from x".
0 We have
1 e itl,)g,~f 1 g~I(X) dx f (y) dy 1 e"'ogxg,(x)h (x)dx
if
0
:~ f'g~(x) 1 h (x) I dx
fo e't log x&  1 (X) 0
X~ xeit'09(Ylx)f(y) < f'g~(x)h*(x) dx = r, (2.4a)
fo F(x) dy~ dx. (2.1) where 0 we define
This expression would be a recursion relation if  (xl f (Y) (a + 1)yC,
the bracketed integral were independent of x This h*(x) = JO 1 F(x) X a+l dy. (2.4b)
will be seen to be approximately the situation. We
first explore the inside integral in the leading case The quantity h* captures just what we need to
f(x) = x', a >  1, know about the approximability of f(x) by x'.
fX "log(ylx)f(y)IF(x)} dy We now use the defining recursion for & to
te calculate a recursion for r,, as follows:
= (a + 1) fo Xe itlog(Y/X)Yaxa1 dy r. = fo 1 g,,(x)h*(x)dx
=(1+a) le"'oluuadu =fl{ 1 ni(z) AX) dz C(x)dx
X
f' 0 f 9 F(z)
0
+ it C. (2.2) =f'gnl(z){ 1fzf(x)h*(x)dx}dz.
1 + a
0 F(z) . o
In the motivating case of 1 Z,, 1, we saw a (2.5)
so, the recursion (2.1) and the identity (2.2) give us So, if there is a 0 < X < 1 such that
that the characteristic function of log 1 Zn is ex 1 z
actly equal to (1 + it12) ', i.e., log 1 Z,, 1 is equal jF(
z) f f (x)h*(x)dx:::~ XC(z) (2.6)
to the sum of n exponential random variables 0
with mean 1. Naturally, this result can be ob for all z E= [0, 11, then
2
tained more directly, but the explicit example rn:5Xr,,, and r,,provides a conceptual guide and analytic cheek on
our main calculation. As a final note on the exam The recursion relations (2.3a) and (2.7) can be
ple, we see that log E Z,, 1 = n log(2/3) is con used to approximate 0,, in terms of powers of 4~
sistent with E log 1 Zn 1 =  n/2 since Jensen's in by noting
equality states E log 1 Z,, :~ log E 1 Z,, 1 and indeed X~lro,
0.5 < log(2/3) =  0.405. ~2(t)Sbn_2(t) 1 2r01
We now let %p (t) = (1 + itl(l + a))' and corn < rn  2

238

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986
and generally, Here one should note that there is nothing
special about the interval [0, 11. It was chosen just
to keep notation clean. Also, the continuity of h*
:c~ r, X"  Mro guarantees that if (2.10) holds, then the nominally
Summing the expressions above for m between 1 stronger uniform inequality (2.6) must hold as
and n  k, we have well. The effective use of the preceding result will
often require a proper choice of scale in order to
(t)  ~n ( t ) Ok ( t t r.Xk (1 _X) be able to establish (2.10), and an instance of such
(2.8) change of scale is given in the course of the
examples detailed in the next section.
Here one should note that the characteristic
function 4i(t) does not vanish for any real t,
~'(0) =  i (1 + a)', and ~"(0) ~  2Q + a) 2. If 3. Random descent algorithms
W is an exponential random variable with mean 1A
equal to (1 + a)  1 and variance U2 equal to (1 + If 0 is a convex function on R d' then a se
a)2 then ~ is the characteristic function of  W. quence of vectors x. E= R d' 1 :5 n < oo, is said to
We see therefore that ~ (t(l + a))e" is the char satisfy the descent condition with respect to 0
acteristic function of a random variable with mean provided O(X~,1)<(P(Xn). Many important al
zero and variance 1. By the central limit theorem, gorithms are known to produce sequences which
we then get that ~"(t(l + a)ln'l')e"~n converges satisfy a descent condition for an appropriate
to e"/' for all t, so (2.8) implies for all fixed k choice of 0, and the descent property is one of the
that basic tools for analyzing numerical algorithms (cf.
~ On + a) 12/2 Luenberger [21). The EM algorithm is an example
limsup ~ e'tvfw  e of such a descent algorithm currently of great
noo Cn interest in statistics (see, e.g., Dempster, Rubin
:~ ro?~(1  X),. and Laird [11 and Wu [5]). In addition to calling
Since k is arbitrary, the required limit theorem is attention to the presence of the descent condition,
established. We can now summarize the main our reason for singling out the EM algorithm is
that it is known to exhibit linear convergence, as
result. opposed to the quadratic convergence which is
typical of Newton method based algorithms. The'
Theorem. If f : [0, 11  R is a nonnegative function lognormal law which was established in the pre
and F(x) = foxf(y)dy < m, 0::5 x:!~ 1, then for the ceding section will now be used to show that, in
Markov chain { Xn ) 'n 0 defined by the kernel general, random descent algorithms have a very
K(x, y) = P(X,,,l E dy 1 X,, = x) =f (y) 1 F(x), specific type of linear convergence.
0 < y < X, Here, for specificity, we will assume that 0 ~t 0
and that 0 is the minimum value attained by 0 in
we have the convergence in distribution of the convex set {x: 0(x):~, 1). We can then define

log X,, + nl(l + a) a random descent sequence by choosing a point
Cn /(1 + a) (2.9) U, at random from {x: 0(x):!s~ 1}, and then
successively choosing U
,, , 1 at random from
to the standard normal, provided only that the {X: 0(x):!~ OWM.
function If we let tt denote Lebesque measure on R
1xl f (y) (a + 1)y` then we will shortly show that the main theorem
h*(x)  JO jF(   1d of this note establishes that X,,= g{x:0(x):5
x) X,, Y
O(U
,,)} is asymptotically lognormal for a large
satisfies the integral inequality class of convex functions, 0.
We first examine the special case of quadratic
f Z f ( x ) h * (x ) d x < h z (2.10) s5. In this instance the expression log Xn can be
F(z) 0 shown to have exactly a gamma distribution.
for all 0 < z ::~, 1. If Q is a syrnmetric positive definite matrix, the

239

1

Volume 5, Number 5 OPERATIONS RESEARCH LETTERS November 1986

quadratic form x""Qx defines a convex function 0 hold for all z in the restricted range (0, 81.
which can also be written as ~(x) = xTMTMx The fact that h*(x) = 0(x') and the identity
where M is a nonsingular linear mapping of Rd (3.2) let us cheek easily that (3.4) holds for a
onto R'. Setting F(t) = p { x : (p (x) n~ t }, we see sufficiently small 8 > 0. By the equivalences just
F(t) = gM'(x: xTx:s~ I} = 0?d td (det M)' established, f, and h* satisfy eq. (2.10).
If fa is associated with the Markov chain Y,, by
~ W did(det Q)  1/2, (3.1) the basic relation (1.1), then Y,, has the same
where wd is the volume of the unit sphere in R". distribution transition kernel as 81X, for X,, :~ 8.
By the characteristic function calculation given by Since n is asymptotically lognormal and since
eq. (2.2), we see that  log X, is the sum of n P(X, > 8) tends to zero, we see X,, is also asymp
independent exponentials with mean 11d. totically lognormal.
We will now pursue the general case and build To summarize, we have proved that if g and
on the fact that any regular convex function is satisfy (3.2), then the random variables defined
locally well approximated as a quadratic. Here, by
the notion of approximation needs to be devel Z, = log g {X: (P(X)
oped with an eye toward the condition (2.10); but, have approximate mean nId and asymptotic vari
2
in sympathy with the leading case (3.1), we will ance nId . Moreover, the standardized variables
first consider those convex 0 which as t  0 satisfy (Z~,  nId)l ~nld2 are asymptotically normal.
the condition
f (t) = d X: S5(X) :~ t Ctd 1 + 0(td1
it 4. Concluding remark
(3.2)
where c and E > 0 are constants. In applications We have exhibited a lognormal law for the
where 0 is more general than the quadratic lead asymptotic state of a class of Markov chains which
ing to (3.1), one can still often obtain (3.2), even have a transition kernel defined by a natural re
with E = 1. The key issue is that of relating condi normalization. The Markov chain is motivated by
tion (3.2) with the more technical hypothesis (2.10) a simple search model and it also serves to model
the behavior of random descent algorithms. These
of the main theorem. 0(td+e), SO random descent models exhibit linear convergence
From (3.2) we have F(t) = ctdId + of a very precise type which is made explicit by
setting a = d  1, we note that for 0 < y < x the lognormal law. It would be of interest to
f (y) (a+') Y"~ = O(Y a+e /X a+l (3.3) provide a random descent model which could
F(x) Xa+1 exhibit the quadratic convergence rates typical of
Newton methods.
and for h * (x) defined as in (2.4b) we have h * (x)
O(X').
If we let fa (x) = f (6x), then the corresponding References
expressions for F. and h* satisfy F8(x)
foxfa (u)d u = 8 'F(8x) and [1] A.P. Dempster, D.B. Rubin and N.M. Laird, "Maximum
likelihood from incomplete data via the EM algorithm", J.
hf xl fs(y) (a + 1)yO, dy Royal Statistical Society B 39, 122 (1977).
,(x
*(x) x)
0 Xa+1 [21 D.G. Luenberger, Linear and Nonlinear Programming, 2nd
ed., AddisonWesley, Reading, MA, 1984.
SA f (U) (a + 1) ua [3] R. D. Luce, Individual Choice Behavior: A Theoretical A nal
F(Sx) (8x)0+1 du h*(8x). ysis, Wiley, New York, 1959.
[41 LM. Steele, "Limit properties of Luce's choice theory", J.
Math. Psych. 11 (2), 124131 (1974).
The basic convergence hypothesis (2.10) ap [51 C.F.J. Wu, "On the convergence properties of the EM
plied to f& is therefore equivalent to the condition algorithm", Annals of Statistics 11, 95103 (1983).
that the integral inequality
f Z f(u)h*(u)du 0
240