Applied and Computational Mathematics, Princeton University, Prineeton, NJ 08544, USA
W.L.STEIGER
Department of Computer Science, Rutgers University, New Brunswick, NJ 08903, USA
Received 20 March 1985
Revised 15 August 1985
Given n points {(xi,yi)~ in the plane we study the problem of calculating the least median of squares regression line. This involves the study of the function fla, fl) = median(lyi (a +)6xi)]); it is piecewise linear and can have a quadratic number of local minima. Several algorithms that locate a minimizer of f are presented. The best of these has time complexity 0(n3) in the worst case. Our most practical algorithm appears to be one which has worst case behavior of 0(n3 log(n)), but we provide a probabilistic: speedup of this algorithm which appears to have expected time complexity of 0((n log(n))2).
1. Introduction
Given n points (xi,yi) in the plane our computational objective is to find a line which is optimal with respect to the criterion of minimizing the median of the squared residuals. Equivalently, because It 12 is monotone in Iti, we wish to find values a* and fl* which minimize the objective function
(1) f(afl)= median(lyi(a+pxi)l).
We follow Rousseeuw [5] in referring to the line y = a * +,8*x a least median of squares fit because of the compelling analogy with the familiar term, least (sums of) squares.
One reason for the interest in the least Median of squares regression procedure is that the method has high breakdown point. Roughly speaking, the breakdown point of a statistical estimator is the smallest percentage of contamination which may cause the estimator to take on arbitarily large values. This concept was first studied explicitly by Hampel [2] who extended an earlier concept of Hodges [3]. It now appears that breakdown point is emerging as one of the basic criteria for judging the robustness of an estimator (see e.g. Donoho and Huber [11). One virtue of least median of squares regression is that it has a breakdown point of 50%, and this is obviously the highest possible breakdown point of any reasonable estimator (see
* Research supported in part by NSF grant DMS8414069.
0166218X/86/$3.50 @ 1986, Elsevier Science Publishers B.V. (NorthHolland)
1
94 J.M. Steele, W.L. Sleiger
Roussecuw [51, which also mentions an algorithm and cites Leroy and Rousseeuw [4]).
In the next section we characterize local minima of f and thus give a necessary condition for global minimizers. This converts the problem of minimizing f into a discrete optimization. In the course of studying the geometry of f, we will show that f can have 0(n 2) local minima. This high number of local minima makes any approach with gradient methods (or even naive line search) a risky proposition, but the characterization of the local minima which we exhibit gives an easy route to exact enumerative minimization.
Two algorithms exploiting our characterization are given. The first of these has time complexity 0(n 3 ) and the second has complexity 0(n 3 log(n)). The reason for putting forth the second algorithm is that in the last section it is modified to provide a probabilistic algorithm which may be the most practical one available. On the basis of modest simulation and a reasonably persuasive heuristic argument, it is conjectured to have complexity OQn log(n)]2) on the average.
In all the considerations which follow, we suppose that the initial data (xi,yi), 1:5 i:s n, are in general position. More explicitly, we assume no three points are colinear, and no two pairs of points determine parallel lines. We also need to disambiguate the notion of median in even sample sizes. If ul < ... :5 u, our convention, based on convenience, will be to take u as the median, where m = 1 + Ln/2]. This reduces to the middle value for n odd and to the high median, or larger of the two middle values, if n is even, One consequence of these conventions is that f> 0 when n is larger than 3.
2. Structure and algorithms
Given a and P, the line 1,,,,6={(xy):y=a+px} defines residuals ri(a,,8)=yi (a + fix). When there is no confusion we will simply write ri. We say that k# 'bisects' the three distinct points (xij, yi,), j = 1, 2, 3, if the residuals ri, are all the same size and not all have the same sign. If also X4 < Xi2 < xi, and ri, = ri2 = ri,, we say that 1,,,,6 equioscillates with respect to the points. It is easy to see that each triple of points is 'bisected' by three lines, one of which equioscillates.
We begin the study of the combinatorics of minimizing f by noting that any line la,p partitions { 1, ... , n} into three sets,
Ba, {i: 1 ri (a, P) 1 >f(a, P)},
M,,, = {i: 1 ri (a, fl) 1 =f(a, fl)},
4,6 = {i: Iri(a, ffil
of big, median, and small residuals, respectively. When there is no confusion we will drop the subscripts. The following simple result articulates a necessary and sufficient condition for local optimality.
i
i
i
1
i
Least median of squares regression 95
Main Lemma. The pair (a is a local minimum off if and only if the jollo wing
conditions hold..
(i) IM.*, p* ~ = 3,
(ii) 1,~, p. equioscillates with respect to the points indexed by M,*,
(iii) IB,,.,fl*l L
Proof. Fix fl and define ri(a) by ri(a) =yi(a+fixi). Suppose that M1 = 1 and that irp(a)~ is the mediansized residual. If rp(a) is positive, irp(a)~ decreases as a increases; and, if it is negative, as a decreases. As a changes, Irp(a)l remains the median until that point a' at which another residual, rq, first attains equal size. We may assume r,,(a')rq(a')<0. Otherwise, we could change a' and decrease both lr,(a')1 and lrq(a')~. Hence, at a local optimum we always have two median sized residuals with opposite signs.
Now we show that (i) is necessary. Suppose (a,,6) is a local optimum, and that there are points, say p and q, whose residuals from 1, fl are median sized and of opposite sign. Note that 1, p contains the point p = [(xp, yp) + (xq, y,)] /2. If no other data point has residual size 1 rp 1, 1,,,,6 may be rotated about p so that both 1 rp 1 and lrq 1 decrease (equally). They will remain median sized residuals up to that point when another residual, say rs, first attains equal size. This proves that (i) is necessary and also that a line defined by a local optimum will 'bisect' the points indexed by M.
In fact if (a, fl) is a local optimum, la8 must equioscillate. If not, it already passes through the midpoint of two 'bisected' points that have opposite sign. By rotating about this midpoint, all three median sized residuals may be reduced in size. Since at least one of these residuals must remain median sized, (ii) is necessary for a local minimum.
To see that (iii) is necessary, suppose that (a,,6) is a local minimizer of f. Then (i) and (ii) hold, so there are p, q, and t e M with xp < xq < x, and rp = rq = r, The line la,,g may be rotated about the midpoint of (xp, yp) and (xq, yq) so that both Irp 1 and 1rqI decrease while ir,l increases. Let (a' ' jW) denote the parameters of the rotated line. By the continuity of yi (a +,8xi), if (a" 8') is close enough to (a,#), Irp(a' fi')J, irq(a' fi%, and ir,(a',fl')1 are smaller than all Iri(a',fl')J, icB,,,fl and bigge; than all 1 ri'(a 'I jT) 1, i E Sa, p. Since f(a, fl)
Besides characterizing local minimizers of f, the lemma turns the problem into a discrete one. Only lines that equioscillate with respect to some triple of data points need be considered. The following algorithm checks f(a,,8) on such a line for each triple. By the lemma, the equioscillating line with the smallest value of f will determine (a*, fl *).
i 1
1
1
1 1
i
96 J.M. Steele, W.L. Steiger
Crude Algorithm
0 d*oo
0 (a~0*)(0,0)
for each distinct triple ijk
* renumber points so xi
• di, j, k f(a, P)
• if d1J,k
This may be the first finite, exact algorithm for least median of squares fits. Its crudeness is reflected in its complexity of 0(n 4 ), even if one uses a linear cost method to obtain median(lyi(a+fixi)~) for each of the n(n1)(n2)16 lines mentioned. It would not be practical for even as few as 50 points.
This crude algorithm nevertheless contains the germ of a useful method with a faster running time. The first step is to focus on lines through pairs of points. Let fli,j=(yiyj)l(xixj). Because equioscillation is necessary, the optimal 8* must equal,Oi,j for some i#j. So regardfl in (1) as fixed, and consider the reduced problem of finding a minimizer of
(2) g(ct) = median(lzi al)
where zi = yi fixi. If the zi are in increasing order, then a little thought convinces one that the minimum value of g(a) is equal to the length of the smallest interval that contains half of the zi. The minimizer a*, is then the midpoint of this interval. Thus, writing m(n) for the median of {1,...,n}l
(3) if zp , m(,) zp = min [(zj + m (,,) zj), j = 1, ..., m (n)l, then
min[g(a)] = Zp + .(n) Zp, a * = (zp + .(,,) + zp)/2.
This formula can also be derived by noting that the graph of g is continuous and piecewise linear. Each piece has slope ± 1, the slopes change sign at each zi, g(zl)
Zm(n)zI, and g'<0 for z
Algorithm 1
(a*,fl*)(0,0)
d*+oo
for each distinct pair rs
0 0 ' P,,.,
• zi yi fixi, n
• sort the zi
• if Zp,m(n)zp=min(zj,m(n)zj), d,Zp+.(n)ZP
1
Least median of squares regression 97
a(Zp+.(,)+Zp)/2
o if d
The complexity within the loop is 0(n log(n)) due to the sort, and therefore Algorithm 1 has complexity 0(n 3 log(n)) overall. This algorithm was used by Leroy and Roussceuw [ 11 for small n although it was not known to be exact. For larger n they obtain an approximation to (a*, fl *) by modifying Algorithm 1 to apply the loop to random choices of r and S, rather than to all pairs.
To go beyond Algorithm 1, it is necessary to explicate more of the geometry than was brought out in the Main Lemma. In condition (ii), let us weaken equioscillation to the requirement that 1, fl only 'bisect' points indexed by M. If (a,,8) also satisfies conditions (i) and (iii), we say it is a 'possible local minimum'. The next result identifies and counts the 'possible local minima'.
Little Lemma. There are exactly n (n 1)/2 choices of (a,,6) such that 1, fl 'bisects' the points in Ma,,6 and the conditions (i) and (iii) hold.
Proof. Consider the line Ljk given by
Y=&k(X_Xj)+Yj
and which contains points j and k. There are n 2 nonzero residuals from this line. Either at least m = Un 2)/2j are positive, or m are negative. Assume there are m positive ones (the negative case can be handled similarly). The first step is to find p such that rp is the mth smallest positive residual from Ljk
Now consider the line Lj'
k parallel to Ljk and containing (xp,yp). There are three points on Lj, k and L+ , m 1 points between the lines, and q = n m 2 points not
i.k
between them. Consider the line Lj*
,k which passes halfway between Ljk and Lj'
,k. The construction assures that it 'bisects' points j, k, and p, that j, k, p are in M, and that (i) and (iii) hold. n
It is natural to ask how many of the 'possible local minima' are in fact, local minima. The answer depends on the particular configuration of data points. In fact for the 'bisecting' line Lj*k to equioscillate, all that is required is that xp lie between xj and xk. Using this observation, we exhibit a configuration where f has a quadratic number of local minima.
Let A denote the Ln14j points with the smallest xcoordinates and B, the Ln14j points with the largest xcoordinates. The remaining points have 'middle half' xcoordinates. They will be assigned ycoordinates larger than those of any of the points in A or B. Now, if point j e A and point k E B are used to define Lj, k, as in the proof of the little lemma, the point p will not be in A or B. The line Li*k Will equioscillate and therefore define a local minimum of f. There are n 2/1 9 such lines, one for each pair j, k in A and B, respectively. This proves the interesting
i
1
98 J.M. Steele, W.L. Sleiger
Corollary. f can have 0(n 2) local minima.
The proof of the little lemma suggests another algorithm. Given j and k, compute the line Lj*k given by
Y=fli,kX+ [Yi+Y,,fijk(Xj+xp)112.
The number p corresponds to the m th least positive (or negative) residual from the line through points j and k, and may be obtained in linear time. For some pair j, k, one of these lines will define the minimizer in (1). It may be found in time 0(n 3).
Algorithm 2
(a*, C)RO)
d *co
M ~ L(n 2)/2j
for each pair r, s
o doo
0 flfl,,
• Zi'yiYr#(xix,,), i=1,...,n
• 17~i:zi>O}, N~i:zi
• a [Yr+Ypfl(xr+xp)112
• if jzpj
3. Deterministic and random improvements
One can improve the practical performance of Algorithm 1 by applying what we call the wedge trick. Algorithm 1 loops over all pairs r and s but, as we shall see, the cost of many of these loops can be reduced from n log(n) to log(n). On the basis of our experience, this reduction is possible often enough to make a big difference in practice.
When examining r, s in the inner loop of Algorithm 1, we find the optimal inter
cept for lines of slope This defines a* satisfying
(4) f(a*,#)
Equation (3) implies that IMI k2 and that there are pqEM such that rprq
Least median of squares regression 99
of this line be PL. Now rotate 1,~ft counterclockwise about p until a point, say t, first satisfies Jr., 1 = jr, 1 and denote the slope of this line by flu. This defnes a wedge of lines through p with slopes in the interval W= 116L, flUl for which both p and q remain in M. Therefore for any fi c W, the optimal line with slope fl is in the wedge. This observation allows one to eliminate from further consideration those rs for which fl,, e W.
The modification to Algorithm 1 (call it Algorithm W) is to compute W= IfiL,PUL take (a,,6) to be the parameters of the better of the two lines defining the wedge (e.g., if 1 r, 1 < 1 r, ~, J6 16J, and then eliminate all r, s for which c W. The wedge may be computed in linear time simply by solving n 2 simple equations that require Iril to be equal to Irpl, i different from p or q. To facilitate the elimination of some a preprocessing step could compute all the 0,, and sort then in time 0(n 2 log(n)). Then, given W, those 6,,, e W may be obtained (say by binary search) in time 0(log(n)) and eliminated from further consideration. In the sorted list of P,,,, we could maintain flags for those slopes which have been eliminated. Given r and s, a search requiring log(n) time reveals whether is flagged. If not, we must process this slope. The optimal intercept for fl,,,s and the wedge generated by may be obtained in time 0(n log(n)). Therefore the total cost of Algorithm W is
K 0(n log(n)) + (n 2 K)log(n) + 0(n 2 log(n)),
where K denotes the total number of r, s pairs that needed actual processing.
MonteCarlo studies of Algorithm W were performed for various models governing random distributions of points and for various values of n ranging from 10 to 125, each combination replicated several times. A rough exploratory analysis of these experiments seems to suggest that 0(n log(n)) is a typical value for K. Although there is not thoroughly defensible model for a 'random' fitting problem, we make the following concrete statement.
Conjecture 1. If (Xi, Yj), 1:5 i:5 n, are independent observations from any bivariate distribution with continuous density, then Algorithm W has expected time complexity O((n log(n))2).
The probability theory involved in this conjecture appears to be nontrivial. On the practical side, two points are in order:
* Algorithm W has space complexity of 0(n 2 ). This means that examples of size 1000 become unrealistic without use of external storage.
* Our experience with data sets of size less than 120 shows that (a*,J6*) can be computed rapidly. John Tukey suggested that a useful approximation to the least median of squares fit for 1000 points could be defined by averaging fits obtained on random subsets of size ~ 100.
The final algorithm, Algorithm R, randomizes the wedge trick. In preprocessing,
100 J.M. Steele, W.L. Steiger
slopes &, are computed and sorted. Then, instead of checking the in order, one is selected at random. If it has not been eliminated already, the wedge W= IfiL,flUI generated byBr,, is computed. Then all fie W are eliminated and the process repeated until all &, have been eliminated or checked.
Algorithm R appears to perform well. We have no proveable complexity results
to assert, e.g., that is superior to Algorithm 1. However we have put together several
heuristic arguments that can be construed as support for the following statement.
Conjecture 2. For Algorithm R, the expected value of K is 0(n log(n)).
We know much less about the general problem. If the (xi, y) are now n points in
R k+ 1, the MM2 R objective function is
median(l yi (ao + alxil + + akXJ 1).
Local minimizers must be the normal vectors to hyperplanes that bisect k + 2 points with median sized residuals, a conjecture Jeff Salowe established by induction on k and using the main Lemma. An analogue of Algorithm 1 has complexity 0(n k+l log(n)).
References
[11 D.L. Donoho and P.J. Huber, The notion of breakdown point, in: PA. Bickel, K. Doksurn and J.L. Hodges, eds., A Festschrift for Erich L. Lehman (Wadsworth, Belmont, CA, 1983) 157184.
[21 F.R. Hampel, A general qualitative definition of robustness, Arm. Math. Statist. 42 (1971) 18871896.
[3] J.L. Hodges, Efficiency in normal samples and tolerance of extreme values of location, Proc. Fifth Berkeley Symposium on Mathematical Statistics and Probability (1967) 163168.
[4] A. Leroy and P.J. Rousseeuw, PROGRES: A program for robust regression, Report 201, Centrum voor Statistiek en Operationeel Onderzoek, Univ. of Brussels, 1984.
[51 P.J. Rousseeuw, Least median of squares regression, J. Amer. Statist. Assoc. 79 (1984) 871880.