The Annals o.f Probability
1978, Vol. 6, No. 1, 118127

BY J. MICHAEL STEELE

University of British Columbia
If Xi, i = 1, 2, .. are independent and identically distributed vectorvalued random variables with distribution F, and S is a class of subsets of R4, then necessary and sufficient conditions are given for the almost sure convergence of(I1n)D,,s = SUPAeS 10112) E IA(Xi)  F(A)l to zero. Thecriteria are defined by combinatorial entropies which are given as the time constants of certain subadditive processes. These time constants are estimated, and convergence results for (11n)D.8 obtained, for the classes of algebraic regions, convex sets, and lower layers. These results include the solution to a problem posed by W. Stute.

1. Introduction. One of the most useful tools of probability theory developed in recent years is the theory of subadditive stochastic processes introduced by Hammersley and Welsh [ 101 and perfected by Kingman [ 11, 12] and Hammersley [9]. The first objective of this paper is to provide three new examples of 'such processes. The underlying motivation for studying these specific processes is that they provide an effective and unified approach to convergence theorems for the empirical discrepancy function.
To set the problem precisely, we let X, i = 1, 2, be a stationary sequence of random variables which are defined on a probability space (Q, P, 90) and which take values in Rd, d > 1. The common distribution function of the X, will be denoted by F; thus for any measurable A c Rd we have F(A) = P(X, c A). For most of the results obtained here, independence of the X, will be assumed, but, as indicated in the following, some results only require ergodicity of the X, process.
Now let S be any class of measurable subsets of M. For any A e S we denote by 1,,(x) the function on Rd which is 1 if x e A and 0 otherwise. The fundamental object of concern here is the generalized discrepancy function defined by

(1. 1) D.s = sup,.,, IZ?=, 1,,(X)  nF(A)l
The most classical choice of S is
S = {( , yJ x ( o., y,] x ... x ( co, yd] : (yl, y, yd) c Rdl

and in this case (11n)D.8 is the very well studied KolmogorovSmirnov statistic (see, for example, Durbin [4]). Beyond these classical regions there are, of course, numerous choices of S of interest. In particular, for S the class of convex Borel sets the first work is due to R. Ranga Rao [141 and there are refinements by W. Phillip [13], P. Gaenssler [81, and W. Stute [17, 18].

Received October 20, 1976; revised April 18, 1977.
AMS 1970 subject classifications. Primary 60F15; Secondary 6OG99.
Key words and phrases. Empirical distribution, subadditive processes, entropy, convex sets, lower layers, algebraic regions.
118

EMPIRICAL DISCREPANCIES AND SUBADDITIVE PROCESSES 119

The study of general classes S was also begun in R. Ranga Rao [ 14], and, this time, followed by Topsoe [19]. The approach which is taken here is along the completely different lines introduced by Vapnik and Chervonenkis [20] and [21
Their analysis was based on the combinatorial, rather than the topological, properties of S. The present work is concerned with perfecting and applying the method of Vapnik and Chervonenkis by means of subadditive ergodic theory.
Since in most cases of interest S is uncountable, the function D,,s is not a priori measurable. Still, for all classes which we consider, D.8 is measurable and for all results quoted here it will be assumed that D..' is measurable.
The main result of this paper is Theorem 4.2 which gives two necessary and sufficient conditions for the a.s. convergence of (11n)D.8 to zero. This result is preceded by two brief sections which introduce the quantities necessary to define the convergence criteria. The fifth section applies the main result to the class of algebraic regions of R,'. The sixth section specializes S to the class of convex sets, and thus obtains a solution to a problem posed by W. Stute [171. The final section similarly estimates the "time constant" of the subadditive process associated to the class of lower layers and consequently obtains as corollaries results due to, Blum [1] and DeHardt [5, 6].

2. The VapnikChervonenkis theorem. We will first observe that there is a direct relationship between the combinatorial structure 'Of S and the convergence in probability of (1 In)DJ. If T is any set we denote by 1 T1 the cardinality of T. Now if W is a finite subset of Rd then there are only finitely many distinct sets W n A for A c S, and we let As(W) denote the number of such sets. More succinctly, we have

(2.1) As(W) = I{W n A: A e S}1 .

To solidify the understanding of this ' definition we note that for S 00, y)
y c R} and W any k element subset of R one has As(W) = k + 1.
Now if Xp i = 1, 2,... are i.i.d. with distribution F the VapnikChervonenkis entropy V(F, S) is defined by

(2.2) V(F, S) = lim,._ 1 E log W(X, X, Xj
n

With this definition we now have the main theorem of Vapnik and Chervonenkis:
If X,, i = 1, 2, . .. are i.i.d. then (1~n)D,.8 converges to zero in probability if and only if V(F, S) = 0.
In addition to this weak law for (I In)DJ, Vapnik and Chervonenkis also provide a strong law under much more stringent conditions on A(X, X, . .., XJ, (see e.g., Theorem 3, [201). A principal corollary of the present work is that one actually has the strong law under precisely the same circumstance as the weak law. That is, we prove that (11n)D,s converges to zero with probability one if and only if V(F, S) = 0.

3. Combinatorial entropy. The Vapnik_Chervonenkis entropy is clearly a

120 J. MICHAEL STEELE

useful measure of the complexity of S, but it is not at all easy to compute. For this reason we introduce a new type of combinatorial entropy which is much more easily computed, but which is still effective in the study of D.B.
If W is a finite subset of Rd, we say that W' is shattered by S provided that every subset of W' can be written as W' n A for some A c S. Next we let ks(W) equal the cardinality of the largest subset W' c: W which is shattered by S. This set function can also be expressed in terms of As(W) as follows:

(3.1) C(W) = max fk: As(WI) = 21,, W' c W, 1WII = k} .

The combinatorial entropy K(F, S) is now defined by the limit

(3.2) K(F, S) = lim inf,, 1 Eks(Xl, X, X,) .
n

In the next section we will prove that (11n)D.8 converges a.s. to zero if and only if K(F, S) = 0. In later sections K(F, S) will be estimated for several specific classes, in particular for the class of convex subsets of Rd.

4. Three subadditive processes. After Kingman [11, 121 we call a family Y, (s < t) of doubly integerindexed random variables a subadditive process pro. vided three conditions hold:

S, Whenever s < t < u, Y.. :~, Ya, + Y,.;
S, The joint distributions of the process (Y.+1,t+l) are the same as those of (Y.A;
S, The expectations g, = E(Y(J exist and satisfy gt ~t At for some A ~> 0 and all t ~~t 1.

The most valuable consequence of these properties is brought out by Kingman's theorem which says that for any subadditive process there is a random variable e with finite mean such that
(4.1) e = litnt Y.tlt
with probability one and in mean.
Perhaps the easiest, but most crucial, observation of this section is that the generalized discrepancy, the VapnikChervonenkis entropy, and the combinatorial entropy are all essentially related to subadditive processes.

THEOREm 4. 1. If X, i = 1, 2, is a stationary sequence of random variables
then each of the following is a subadditive process:

(4.2) D8 = SUPA.S 1 E1.1+1 1 AR1)  (t  s)F(A)J
at
(4.3) log AS log W(Xa+', x#+2,
at ' X)

(4.4) k' C(X.,, Xa+2, Xt)
at

PROOF. The stationarity property S, follows directly from the stationarity of the X, in all three cases. Also, since each of the processes is positive and bounded we note that S, is valid. Only S, will require individual checking.

EMPIRICAL DISCREPANCIES AND SUBADDITIVE PROCESSES 121

The proof of S, for D.st is just the triangle inequality. TO prove S, for log A,It we first note that for any A e S one has

(4.5) A n {X.,, X.,., . . ., X,, xl+11 ... 1 XJ

= (A n Ra+V X1+21 1 XJ) U (A n R,+0 xt+29 XUD

But (4.5) shows that there are fewer sets of the form A n (X.+, X,, 1 X.}
than there are pairs of sets (A n (X.+,, X.+, . * * y Xt)~ A n 1Xt+l, Xt+25 X.D.
Since the number of such pairs is As As , (4.5) proves that
at tu

(4.6) As :~ AS AS
\$U  at tu

On taking logarithms one has S, for log As at'
Now to prove S, for ks suppose that W is the largest shattered subset of at
lx.+V xe+2y 3 xt+15 Y Xu}. Then W, = W n (X.+, X.+, XJ and W,
W n R,+V Xt+21 X.} are also shattered. Hence we have

(4.7) k' = 1WI = 1Wl + 1Wl 5 ks + ks
at at tu

and the proof of Theorem 2.1 is complete.
To state the next result in a symmetric fashion we introduce a constant D(F, S) defined as follows:

(4.8) D(F, S) = lim inf.__ 1 E SUPA. 8 i"=, 1 A (XJ  F(A) 1
n

THEOREM 4.2. If X, i = 1, 2, . . is a stationary ergodic sequence then the followin&, limits hold almost surely and in mean:

1 D8 = D(F, S)
(4.9) liffl.  0"
n

(4.10) liffl._ 1 log A08n = V(F, S)
n

(4.11) lim,, 1 kS = K(FyS).
On n

Moreover, if the Xt are i.i.d. and one of the constants D(F, S), V(F, S), or K(F, S) is zero then they are all zero.

PROOF. The existence of the limits on the left side of equations (4.9), (4.10), and (4.11) is an immediate consequence of Kingman's subadditive ergodic theorem and Theorem 2. 1. Further since these limits are unchanged by the shift (X)  (X,,) the ergodicity of the (Xi) process implies the limits are constants. These constants are then identified as above by noting that all the variables above are bounded by 1 and then applying the dominated convergence theorem.
To prove the second half of the theorem we first recall that if the X, are i.i.d. the main theorem of Vapnik and Chervonenkis says that nlDs converges to
On zero in probability if and only if V(F, S) == 0. Hence by the convergence in

122 J. MICHAEL STEELE

mean of nID08. we have that D(F, S) = 0 if and only if V(F, S) = 0. The theorem is completed once one proves that V(F, S) = 0 if and only if K(F, S) = 0 and that will be accomplished by the following lemma.

LEMMA 4.3. If we define h(x) = log xl(l for 0 < x < j and h(x) = 1
for J :!~ x < 1 we have the inequality

(4.12) K(F, S) :!!~ V(F, S) ;S h(K(F, S)).

Also we have V(F, S) = 0 if and only if K(F, S) = 0.

PROOF. Since 21's <= As the first inequality of (4.12) follows by taking loga
On
rithms and limits. To prove the second inequality we recall a result due to N. Sauer [151 (which is only a little sharper than Lemma 1, pqge 266, [201). According to Sauer's theorem there must exist a shattered k element subset of X, X, ..., X,, if A.s% >= E tl (%) + 1. Hence for any 0 < a < j we have that k.s. < [an] implies

(4.13) A On :S E ill%, (7) ;5 (a 1) G 1)

By Stirling's formula one then obtains
(4.14) V(F, S) :!s~ h(a)
for all a such that K(F, S) < a. Finally by the continuity of h(x) one obtains
(4.12) and the fact that V(F, S) = 0 if and only if K(F, S) = 0.
The proof of Lemma 4.2 completes the proof of Theorem 4.2 and, even though the result has been obtained effortlessly, we will note in succeeding sections that it contains numerous known results concerning Ds as well as new ones.
On
Simplest among the conclusions of Theorem 4.2 is the fact that (l In)Ds always
On
converges a.s., even when the limit is not zero. In the present context this is not as surprising as it would be if one approached the study of Ds without sub
On
additive ergodic theory. Otherwise, it would have been noted long ago.
To demonstrate the applications of the other conclusions of Theorem 4.2 we will consider specific classes in the following sections and show how K(F, S) can be computed.

5. An application to algebraic regions. The first special class which we consider is the class of polynomial regions. To describe these we first let P,, denote the set of all real polynomials in d variables of degree not greater than m. For example,
P2.3 aijk x`yj2,1, a,,, real, 0 i + j + k s; 2}
Now if g e P,, then g defines a subset Al. of M in a natural way as follows,

(5.1) A, = (v c R,' g(v) ~t 0} .

Finally, we call the class of sets P.,, {A,: g c P.,j the class of mth degree polynomial regions of R,'. One should note, for example, that Pm contains all the ellipsoids in R' as well as many other regions.

EMPIRICAL DISCREPANCIES AND SUBADDITIVE PROCESSES 123

An especially important class of algebraic regions are the halfspaces H = Pm.
For these regions the calculation of AH is essentially a classical result due to
0%
Schifli (see [3] and [16]), which we now note.

PROPOSITION 5. 1. If yl, y,, .  ., y. are elements of Rr and if H = P,,, then

(5.2) AH(y~, y', . . ., y.) ;S 2 Er=. (m') ,
and equality holds in (5. 2) provided the y, are in general position (i. e., any T or fewer
of the yi are linearly independent).

That an estimate for As with S follows from (5.2) was first observed
On
by T. Cover [31, whose brief proof we quote.

PROPOSITION 5.2. For S = P.,,, one has

(5.3) As t, 2 7, r_o where r = (r*+d) .
On d
PROOF. The idea is to associate to each element of S = P.,d an element of H = P,,. If y c Rd and y (y(I), y(2), . . ., y(d)) we define (D(y) c R" by

(5.4) 0(y) = (1, y(l), ., y(d), y(l)', . . ., y(d)2, y(i)y(j), y(d)") .
Now the proof is completed by noting that
AS(y~, y', y.) ;5 AH(O(y), 0(y,), o(y,)) .!s~ 2 E r_
k () k

The preceding estimates naturally yield the following consequence.

COROLLARY 5. 1. If X,, i = 1, 2, . .. is a sequence of i. i.d. random variables with any distribution, and if S = P.,d then lim.(11n)D8 = 0 a.s.
On

Of course, to prove the above one just notes that A4 is dominated by a polynomial and thus liim.. (11n) log As = 0 a.s. One should note that the corollary
on
contains the GlivenkoCantelli theorem for half spaces due to Wolfowitz [22, 231.

6. Convex sets and Stutels problem. By applying our main result, Theorem 4.2, to the class of convex Borel sets, we can now answer a question posed by W. Stute [17, page 1681.
To state Stute's problem we first note that for any distribution F, there are unique finite measures F. and F, such that F = F, + F,, F, is atomic, and F, is nonatomic. Next if C is the class of convex Borel subsets of Rd and A c C we write aA for the boundary of A. It was proved by R. Ranga Rao [141 that

(6.1) sup,,, F,(aA) = 0

is a sufficient condition for D(F, C) = 0. To provide a comparable necessary condition, W. Stute [181 introduced e(A), the set of extreme points of A and proved that D(F, C) = 0 implies

(6.2) sup.., F.(e(A)) = 0

The problem considered of "considerable interest" by W. Stute ([ 17, page 1681) is to find conditions for D(F, C) = 0 which are stronger than (6.2). The following

124 J. MICHAEL STEELE*

theorem proves a new proof of R. Ranga Rao's sufficient condition, W. Stute's necessary condition, and an answer to W. Stute's problem.

THEOREm 6. 1. Let X, i = 1, 2, . . ., be a sequence of ergodic stationary random variables with values in Rd and distribution F. For C, the class of convex Borel sets, one has

(6.3) sup,.. F.(e(A)) ;5 K(F, C) :!~ supA ., F,(aA)

PROOF. First note that any finite set of distinct elements of e(A) is shattered by C. Now if DO is the union of all the atoms of F, then by the continuity of F on D the set (X,: X, e D, i = 1, 2, .. ., n} contains only distinct points with probability one. Hence we have for any A e C
(6.4) kc(X, X, . .., X.) k Ei, 1Dne(A)(x, i)
By the ergodic theorem (6.4) yields

lim. nlk,r, ~~: F(e(A) n D) = F,(~(A)) and taking the supremum yields the first inequality in (6.3).
To prove the second inequality we require a simple truncation. Letting B. denote the ball about 0 of radius m we define k(m) by

(6.5) k(m) = lim.n1k1QX, X, Xj n B. n L;) .

(6.6) K(F, C) :5 k(m) + F(B.0) + lim.nlkc(IX,, X, X.} n Do)
If A, A, denote the atoms of F and G, = U,,k A, then. for any k 2~t
(6.7) nlkc({Xl, ..., Xn} n DO) ::~ n1k + n1 E?., 1,k(Xi).

The ergodic theorem yields for all fixed k,

lim._. nlkc(IX,, X, . . . , X.} n DO) :!s~ F(GA:) .
Since F(Gj > 0 as k oo, (6.6) simplifies to
(6.8) K(F, C) ::~ k(m) + F(B.0) .

To estimate k(m) we use the Blaschke selection theorem (see Eggleston [7, page 641) which allows us to choose for each r a finite subclass S,. of convex subsets of B. with the following property:
If E is a convex subset of B, there is an A c S, such that aE c (x: inf,,,, 1x  yl t~ llr} .

We note that the set {x: inf,.,,, 1x  yl < 6} is a "thickened boundary" of A
and we will denote it by T(A, 5). Now we have a basic observation:
(6.9) kc({Xl, X, ..., Xj n B,. n D) < max,,,,. E?=, iCne(B)(Xi) *
The proof of (6.9) consists in noting that for a set of points to be shattered they
must be extreme points of a convex set. Now by the choice of S, the inequality

1

i

i

1

i

i

EMPIRICAL DISCREPANCIES AND SUBADDITIVE PROCESSES 125

(6.9) implies that

(6.10) kc({X~, X,, ..., Xj n B. n D) ;5 max,,,,
1=1 'DnT(A,llr)(Xi) which by the ergodic theorem yields

(6.11) k(m) ;5 maXA . S, FO(T(A, 1 Ir))
TO take advantage of (6.11) we choose for each r an A, e S, such that

(6.12) k(m) ;5 F.(T(A,, llr)) for all r .
Again by Blaschke's theorem there is a convex At c: B. such that one has the following:
Given any 6 and any R there is an r = r(b, R) such that r ~t R and

T(A,., 1 1r) c T(A', 6)
Hence we have

(6.13) lim, F,(T(A,, 1 Ir)) :!~ F,(T(A1, 3)) ,
and by (6.12) and (6.6) this yields

(6.14) K(F, C) :!~ F(B.0) + F(J(X, 3)).

The countable additivity of F, gives lim,_(, Fj(X, 6)) = F,(aA') so (6.14) implies the second inequality in (6.3), since F(B.0) can be made as small as we like. This completes the proof of Theorem 6. 1.
Now we need to check that Theorem 6.1 actually provides a solution to Stute's problem. We already know from Theorem 4.2 that K(F, C) = 0 is necessary and sufficient for D(F, C) = 0. But by the inequality of Theorem 6.1 we also have K(F, C) = 0 is in a very strict sense a stronger necessary condition than

SUPA., Fje(aA)) = 0 for D(F, C)= 0

One further consequence of Theorem 6.1 is that in the case where all subflats have measure zero one can prove that

sup,,.,, F.(e(aA)) = sup.,,, FjaA)

so the inequalities in (6.3) become equalities. The Theorem 6.1 is then an instance of computing the "time constant" of a subadditive process, and the calculation of such constants was pointed out by Kingman [131 to be a basic problem in the theory of subadditive processes. 1

7. The class of lower layers. There is an interesting class of sets introduced by J. Blum [ 11 which is quite different from the class of convex sets but for which a twin result to Theorem 6.1 can be proved. A Borel set A c R d is called a lower layer provided for all (y,, y,, . . ., yd) c A we also have

(7.1) {(ylt, yJ, . . ., y,,'): yit ~5 y,, i = 1, 2, . . ., d} c: A .

To fortify ones understanding of (7.1) it is useful to check that in R' the set of points which lie below a decreasing function forms a lower layer.

126 J. MICHAEL STEELE

The class of lower layers in M will be denoted by L, and we need a notion of extreme point for A e L. By analogy to the convex case we say x is an extreme point of A e L provided A\1xl e L. There will be no confusion in denoting the set of extreme points of A by e(A). If F,, is again used to denote the nonatomic part of F, then the following result holds:

THEOREm 7.2. For X,, i = 1, 2, a stationary ergodic sequence with values
in Rd and distribution F,

(7.2) SUPA  L F,(e(A)) ~5 K(F, L) ;5 sup,,.,, Fo(aA)

The proof goes so nearly like that of Theorem 6.1 that it can be omitted safely. The crucial observation is that a set of points {xl, x,, . . ., x,.} c: R d is shattered if and only if {x,, x,, . . ., x,,} c e(A) for some A e L. Also one needs to know that there is a compactness, theorem due to Brunk, Ewing, and Utz [2] which plays the same role for L that Blaschke's theorem plays for C.
As a corollary to Theorem 7.1 we obtain a sufficient condition (due to Blum [11) and a necessary condition (due to DeHardt [5, 6]) for a GlivenkoCantelli theorem for the class of lower layers.

COROLLARY 7.2. For X,, i = 1, 2, ... Li.d. one has D(F, L) = 0 if sup,., F,(aA) = 0, and sup.,., F,(e(A)) = 0 if D(F, L) = 0.

REFERENCES
[11 BLUM, J. R. (1955). On convergence ofempirical distribution functions. Ann. Math. Stat13t. 26527529.
12] BRUNK, H. D., EwING, G. M. and UTZ, W. R. (1956). Some Holley theorems for monotone functions. Proc. Amer. Math. Soc. 7 776783.
[3] COVER, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications to pattern recognition. IEEE Trans. on Elec. Comp. EC14 326334.
[4] DURBIN, J. (1973). Distribution Theory for Tests Based on the Sample Distribution Function. 'SIAM, Bristol.
[5] DEHARDT, J. (1970). A necessary condition for GlivenkoCantelli convergence in E,.. Ann. Math. Statist. 4121772178.
[6] DEHARDT, J. (1971). Generalizations of the GlivenkoCantelli theorem. Ann. Math. Statist. 42 20502055.
[7] EGGLESTON, H. G. (1958). Convexity. Cambridge University Press, Cambridge.
[8] GAENSSLER, P. (1974). Around the GlivenkoCantelli theorem. Colloquia Mathernatics,
Societatis Jinor Bolyai 11 Limit Theorems of Probabiiity Theory (P. R6v6sz, ed.)
Keszthely (Hungary). 1
[9] HAMMERSLEY, J. M. (1974). Postulates for Subadditive Processes. Ann. Probability 2 652680.
[101 HAMMERSLEY, J. M. and WELSH, D. J. A. (1965). Firstpassage percolation, subadditive processes, stochastic networks. and generalized renewal theory. BernoulliBayesLaplace Anniversary Volume. SpringerVerlag, Berlin, 61110.
[111 KINGMAN,J.F.C.(1968). The ergodic theory of subadditive stochastic processes. J.Roy. Stalist. Soc. B30 499510.
[12] KINOMAN, J. F. C. (1973). Subadditive ergodic theory. Ann. Probability 1883909.
[131 PHILLIP, W. (1973). Empirical distribution function and uniform distribution mod 1. Diophantine Approximation and its Application. Academic Press, New York.

1

EMPIRICAL DISCREPANCIES AND SUBADDITIVE PROCESSES 127

[141 RAO, R. RANGA (1962). Relations between weak and uniform convergence of measures with applications. Ann. Math. Statist. 33 659680.
[151 SAUER, N. (1972). On the density of families of sets. J. Comb. Theory (A) 13 145147.
[16] SCHLXFLi, L. (1950). Gesammelte Mathematische Abhandlungen L Verlag Birkhailser, Bern, 209212.
[171 STUTE, W. (1975). A necessary condition for the convergence of the isotrope discrepancy. No. 16 RUB Preprint, RuhrUniversitht.
[18] STUTE, W. (1976). On a generalization of the GlivenkoCantelli theorem. Z Wahrscheinlichkeitstheorie und Verw. Gebiete 35 167175.
[191 TopsOE, F. (1970). On the GlivenkoCantelli theorem. Z Wahrscheinlichkeitstheorie und Verw. Gebiete 14 239250.
[20] VAPN1K, V. N. and CHERVONENKis, A. YA. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probability Appl. 16 264280.
[21] VAPN1K, V. N. and CHERVONENKis, A. YA. (1974). TeoriyaRaspoznavaniyaObrazov. Nauka, Moscow.
[221 WOLFOWITZ, J. (1954). Generalizations of the theorem of GlivenkoCantelli. Ann. Math. Statist. 25 131138.
[231 WOLFOWITZ, J. (1960). Convergence of the empiric distribution function on halfspaces. Contributions to Probability and Statistics. Stanford University Press.

DEPARTMENT OF MATHEMATICS
THE UNIVERSITY OF BRITISH COLUMBIA 2075 WESBROOK PLACE
VANCOUVER, B.C.