## Please see PDF version

The Canadian Journa of Statistics vol.6 (1978), No.2, 193-200.
La Revue Canadienne de Statistique

INVALIDITY OF AVERAGE SQUARED ERROR CRITERION
IN DENSITY ESTIMATION

by
J. Michael Steele*
Stanford Universi ty
Key words and phrases. Density estimation, mean integrated squared error,
average squared error. AMS 1970 subject classifications: Primary 62G20;
secondary 62F20.

ABSTRACT

The average squared error has been suggested earlier as an appropriate estimate of the integrated squared error, but an example is given which shows their ratio can tend to infinity. The results of a Monte Carlo study are also presented which suggest the average squared error can seriously underestimate the errors inherent in even the simplest density estimations.

1. INTRODUCTION

In almost all theoretical work on density estimation the quality of the estimator f (x) has been measured by the mean integrated squared error n
NSE) which is given by
MIf 2
SE(f, ')  & fff,(f(x)fn(x)) w(x)dx]

Here w(x) is a fixed weight function often taken to be 1, and & f is
the expectation under the true density f(x) .
While this criterion is very convenient in mathematical analyses, the
presence of the n fold integral represented succinctly by & f presents
considerable difficulty when the MISE is to be determined numerically. In
order to circumvent this difficulty Wegman (1972) proposed the use of the
average squared error (ASE) which is defined by
n
ASE(f, (1/n) E (f (2)
fn n t

where ~ri, . , Xn are the same observations used in the construction of
fn(x)

Research supported In part by the National Research Council Canada while the author was at The University of British Columbia.

1941 STEELE [Vol.6, NO.2

This is certainly a more convenient measure of quality than the MISE, but, this convenience is paid for in several ways. In the first place, how does one relate the ASE to the MISE? In the second, how much effect is there on the ASE due to the use of the same data in constructing ~~ (x) as in testin
ing fn(x) ?
The original motivation for the ASE given by Wegman (1972, p.228) was that it should approximte the integrated squared error with a weight function f(x). This interpretation was carried on by Fryer (1977) who refers to the ASE as "an experimental MISE11. The first objective of this paper is to scrutinize this motivation further and to suggest the difficulties it poses.
The second objective is to report the results of simulations which were conducted to determine the effects of using the observations both for constructing and assessing fn(x). Since such a procedure is analogous to testing a discriminator on the data used to construct it, one could expect the ASE to seriously underestimate the errors of estimation. Although the study reported here is modest in scope, the results clearly support this expectation. In the Conclusion, some problem are mentioned which would be of interest for the foundations of the theory of density estimation.

2. HOW CLOSE ARE THE ASE AND THE MISE?

The ASE can be written suggestively as

(3)

where F n (x) is the empirical distribution function. As Wegman notes, when
n becomes; large dF n (x)approximates f(x)dx, hence (3) approximates
00 2
f~(f,(x)f(x)) f(x)dx (4)

This last integral is naturally the integrated squared.error (ISE) with weight
f(x)
One intrinsic difficulty with this line of reasoning is that both (3) and (4) are approximately zero for large n. Hence, in using ASE as a standin for (4) which is in turn used as a standin for the MISE, the relevant comparison would be provided by the ratio of (3) to (4). The theorem of this section shows tItat even in the favorable case of estimation of smooth densi
This is one mans of showing ties this ratio can be disappointingly large. that the ASE is an inappropriate substitute for the MISE.
Before stating the theorem we recall that a density estimator fn 'S consistent for a family of densities F provided that for any f EF one has

19781 AVERAGE SQUARED ERROR U7VALIDITY [195

lim f'(?,,(x)f(x))f(x)dx = 0

If the convergence above is almost sure, fn is called strongly consistent,
and if the convergence is in probability, f n is sinply said to be consistent.

THEOREM. There is a density estimtor which is strongly consistent for the
class of differentiable squared integrable densities, but for which

ISE(~nlo)
~. C0 a. s. as n (5)
ASE(fn,O)
where 0 is the unit normal density..

Proof. Let j, be any strongly consistent density estimator. We will obtain
n
fn as a modification of ~ n . Now for any subset s of a, T,(x) will be
the indicator of S. Letting A [x. i/n~ x.. + l/n 2 we set

2
2 n 1x.
p n (x)  (1+n (270 lille 'IAi(x)

Next choose a sequence v n of reals which tends to infinity and which satisfies

r 0 (x) dx 2: l/n (6)

n
We then let 3 n [v ns v n +11\ U A 11 and set
'1

S n(x) (X)

n
Finally set C U A )UB )c and define f (x) by the sum
n n n

fn(x)  gn(x)IC (x) + rn(x) + 3 n (X).

Write if~ll 2 foT the weighted L 2 norm of 0 which is given by

2
Odx)

TO prove f is consistent we apply ~owskils inequalit?i to obtain

f~fn" 2 fi n ll 2 + n I cn i n 11 2 + lir n 11 2 + lie n 112 (7)

1961 STEELE [Vol.6, No.2

One easily sees that lif~,11 2 Y 11 r n 11 2 Y and Ils n 11 2 go a.s. to 0 as n
so we consider the more subtle second term. By Schwarz! inequality,

1 2 :5 11 11 2 111 Cli 2
n n
1 v n+l
:511 ~n 11 2( f n On(x)dx) 5 Qn 11 2 (fv 0(x)dx) + n f 0 (X) dx)
B U( U A.) n A.
n U
i=l

Since g^ is consistent li~ 11 2 is bounded. Further since the measure of n U A. is at most 2/n and by the boundedness of 0 we see that

nf 0(x)dx .U A.
t=l I,

goes to 0 as n) . Finallysince v n~.m we have

f Vn+l OWdx 0 v n

as n ; hence (7) shows fn is in fact strongly consistent. To prove the key condition (5) we note that

2 2 2
n 2 i nj Ixi
< l/n3
n e E
ASE(f,Q = E 1 ( e
1111lr 27M 4

Next we have
CO 2 (1_0 (x)) 20
ISE(fjn)  f_.,,(fn(x)  OW) 0(x)d=2: n (x)dx (8)
n
Since U A. has measure at most 2/n we have
v +1
n 2 ))2 f n 0(x)dr
ISE(f,Qk f (10 (4) 0 (x) dx k (10 (v n (9)
vi+2/n v n'4

Finally, inequalities (6), (8), (9) show that the ratio of ISE(~n' 0) to
2
ASE(f.,0) increases like n as n tends to infinity. Q.E.D.

Remarks. Much of the detail of this construction is caused by the necessity
of making ISE(fn,o nonzero and the desirability of making ASE(,~,0) also
nonzero. The estimator constructed here is not itself a density since
fn
it need not have integral 1. While it is not Imusual to consider such esti
mators (e.g., orthogonalseries methods give nondensitY density estimators)

19781 AVERAGE SQUARED ERROR INVALIDITY [197

one should note that. fn can be adjusted easily to make it a proper density.
Also, we note that f n is not smooth, but it can be made smooth by routine
modifications.
There is no claim that the example constructed above is %aturaV but it serves well enough to pinpoint the possible extreme pathologies of the ASE. The more natural pathologies are taken up in the next section.

3. DOES THE ASE UNDERESTIMATE?

The fact that in some cases the ISE can be many times larger than the ASE might not deter a practical person's desire to use the ASE in assessing the quality of an estimator. On the other hand, if the ASE were seen to consistently underestimate the error of an estimator in very simple situations, then almost any application of the ASE would be dubious.
In order to detect this possibility we consider the new average sum of squared errors (NASE). If 7n is constructed on the basis of a sample
..,X from a population with density f, then a second independent
XPXV' n
sample X' X' .... X' is drawn and we set
19 2* n

n
NASE(f,Q  (l/r)E (f (X I.)  f (X,")),
j.1 n i

We can now check that the motivations given to support the ASE's case for approximating ISE can be repeated verbatim m behalf of the NASE. First we write ~, for the empirical distribution function of X
1 211Xn'
Since the X! have density f(x) we see dP approximates f(x)dx just n
as dPn did earlier. Since NASE(f,Q is precisely equal to

f(x))2 d~ (x)
Jfn(x) n

the NASE(fJn) is^thus an estimate of ISE(f,f ) with just the same pedi
n
gree as the ASE(f,fn)
The point to be made is that (a) if the a priori arguments in favour of NASE and ASE are the same and (b) if NASE and ASE differ significantly, then one must conclude that the a priori arguments do not constitute a significant motivation for the ASE.
It remains to be seen if the NASE and ASE are significantly different.
To this end a modest Monte Carlo study was undertaken of the ratio

RRO(fJn) NASE(f,fn)1ASE(f,Q (10)

1981 STEELE [Vol.6, No.2

In order not to confound the difficulties of interpreting Monte Carlo evaluations of (10), RHO was calculated for very simple estimators of an
essentially parametric type. The cases considered were the following:
I. The unit normal was estimated by
fn (x) 0 (x +
II. The triangular density with base [0,11 was estimated by f,(X), the triangular density of base [0,
III. The rectangular density with base [0,1] was estimated by f,(x), the rectangular density with base [0, n+l max(X.)] n 1Si
TABLE 1: Numbers of Values of RHO out of 1000 for Three Different Density
Estimators.
Range of 1 Normal tI Triangular III Rectangular
RHO n = 10 20 so 10 20 50 10 20 so
0  1 36 1 0 55 11 0 1 0 0
1  2 85 is 0 138 71. 7 532 512 506
2  3 121 45 2 127 116 44 44 5 0
3  4 110 60 8 86 107 54 33 17 1
4  5 83 56 8 85 81 77 34 16 1
5  6 77 62 11 60 75 67 31 15 3
6  7 54 56 27 46 55 69 24 19 3
7  8 39 49 37 26 29 45 23 21 9
8  9 36 37 19 12 32 44 12 24 4
9  10 26 38 30. 4 31 42 13 9 4
10  11 27 32 23 2 19 34 11 19 8
11  12 20 31 29 0 29 2 7 12 15 8
12  13 18 40 23 1 21 28 13 17 6
13  14 16 21 32 0 9 36 8 13 7
14  15 13 14 14 5 6 16 10 4 4
15  16 10 19 31 2 5 9 14 6 7
16  17 12 22 13 1 2 7 7 8 3
17  18 9 11 20 2 0 13 8 13 9
18  19 12 17 17 0 0 3 3 10 8
19  20 12 is 18 3 0 10 8 9 11
20  30 61 102 154 30 6 126 34 56 53
30   133 249 484 315 295 242 125 197 345
Mean 25 78 138 4700 55000 37000 34 89 118
St.Dev. 141 313 1100 41200 105000 10 6 316 1140548
Largest 4000 30000 25000 93000 29X105 ~29X106 . 7900 260009300

1978] AVERAGE SQUARED ERROR INVALIDITY [199

For sample sizes n10, nm20, and n50 the value of RHO(f,?n) was calculated 1000 times in each of the cases. For example, from Table 1 we see that in 1000 calculations of RHD by procedure II with a sample of size n20 there were 29 times that RHO was between 7 and 8. Also, the mean, standard deviation and largest reported at the bottom of Table 1 refer to the set of all 1000 values generated for each column.
There are a number of conclusions which can be drawn frm the calculations. The most naive but most basic is that RHO is frequently very large and consequently the ASE may frequently give a serious underestimation of the error inherent in estimating f(x) by ?n (x). There are qualifications to be made and these are taken up in the Conclusion, but there is a lot of validity in the naive observation.

4. CONCLUSION

The 1heorem of Section 2 gives a theoretical reason why the ASE might be very small compared to the ISE, and the simulations of Section 3 show how ASE can be very small compared to the even more natural measure NASE. The clear conclusion is that the ASE can seriously underestimate the errors of density estimation.
There are numerous possible criticisms of the procedures which have been applied here, yet none of these seem to seriously inveigh against the conclusion. One may observe that the example given in Section 2 is not natural since ^ was constructed precisely to have a small ASE when used to esti
fn mate 0. This may indeed be trickery, but one should not be prepared to take as a standard a measure which can be so easily tricked. Moreover, as it is always the case with counterexamples, once one has been produced it suggests the possibility of more natural ones existing all around us.
The criticism of the second section are essentially the generic criticisms of any Monte Carlo study. Since all possible care was taken with the random number generation, one mist conclude that the huge range of values of RHO given in Table 1 are true reflections of the ratio of the NASE and ASE. Since only three estimatLon problems were considered it is possible that there are density estimation problems in which ISE, NASE, and ASE are all comparable. lffiile it would be interesting to know if such estimators exist, it seems already a sufficient indictment of the ASE that it is shown deficient by the three simplest estimators.
Beyond the basic conclusions to be drawn from Table 1, the data given there suggest several problems  Can one prove that under most circumstances

lim NASE(f,j,)/ASE(if,j,) a.s. ?
noo

2001 STEELE [Vol.6, No.2

ThM.s is hinted at by Table 1, and has been proved in some special cases but
it would be interesting to kxi(y~fi how generally it holds.
The main problem suggested by the preceding analysis is naturally the
following: Is there a ntmierically e,Tedient measure which accurately reflects
the error of a density estimation? The incentives which lead to Wegman's
original introduction of the ASE remain as valid as before, but the deficien
cies of the ASE serve as an indication of the problems to be overcome.

RiSUME, M'

Merreur carrge approachge (ASE) a d~tiS introduite comma un bon estimateur de Verreur carrge integrge. (ISE).On pr5sente un example o~1 le quotient de ces deux erreurs tend yers 1'infini. Cet article contient 6galement les resultats d'une gtude de Monte Carlo qui sugghre que Verreur carr6e approachge (ASE) sousestime les erreurs encourues, mgme dans des cas tr~s simples.

REFERENCES

Fryer, M.J. (1977). A review uf some nonparametric methods of density estimation, J. Inst. Math. Appl., 20, 335354.
Wegman, E.J. (1972). Nonparametric probability density estimation: a comparison of density estimation methods. J. Statist. Comp. and Simulation, 1, 225245.

Received 9 August 1978 Department of Statistics
Revised 11 October 1978 Stanford University
Stanford, California
94305 U.S.A.