DATA ANALYTIC TOOLS FOR CHOOSING TRANSFORMATIONS IN SIMPLE LINEAR REGRESSION

Richard D. De Veaux andJ. Michael Steele, Princeton University

Program in Statistics and Operations Research, School of Engineering and Applied Science, Princeton, NJ 08544

ABSTRACT One problem of ACE is that the transformations,

while maximizing linear association, may introduce

Transformations of the regressor andlor the hetero'scedasticity in the response. The AVAS algorithm of

response in simple regression are often sought to Tibshirani (1988) is designed to alleviate this problem. It

increase linear association and to make residuals is similar to the ACE algorithm except that instead of using

appear more nearly normally distributed with g.,,(y) = E (f,, (x) 1 y) 1 [var (E (f,, (x) 1 y))J"4 it uses the

constant variance. The ACE (alternating condi. asymptotic variancestabilizing transformation. (The

tionaI expectation) algorithm of Breiman and details of the algorithm can be found in Tibshirani (1988)).

Friedman (1985) finds the transformations max The ACE and AVAS algorithms can be useful as

irnizing the correlation between the regressor and standalone tools for descriptive purposes. The result of

response, while the AVAS (additivity and vari each algorithm is two estimated functions f (xi) and

ance stabilization) algorithm of Tibshirani (1988) g (yi), 1 !5 i 5 n. However, it is often desirable or even

uses a variancestabilizing transformation of the necessary to obtain specific functional forms for f and g,

response. An exploratory data tool, the bulging that is, to find functions which approximate f and g and

rule of Mosteller and Tukey (1977) is used to find retain the desirable properties of the ACE or AVAS

specific functional forms for the relationships transformations.

suggested by the ACE and AVAS algorithms. The purpose of this paper is to illustrate the use of the

Data on the water content of soil are used to illus bulging rule of Mosteller and Tukey (1977) as an aid in

trate the procedure. finding an explicit functional form approximating the ACE

and AVAS transformations. Additionally, we compare the

1. Introduction. differences in the transformations suggested in these algo

rithms. Data from an experiment on soil water diffusivity

Recently, two powerful methods for estimating are used as an example. The parameter of interest, the dif

Optima] transformations for regression and correlation have fusion coefficient is a product of two functionals (a deriva

been proposed. For data (xi,yi), 1 :5 1 5 n, the ACE algo. tivo and an integral) of X and Y. By finding an explicit

rithm of Breirnan and Friedman finds transformations f additive functional form F (Y) = a + bG (X) we are able to

and g such that the empirical correlation of the transformed calculate p ~the functionals explicitly.

data T (xi), 9 (YjA 1 !~ 15 n, is approximately maxindzed. Th&'article is organized as follows. The diffusion

711e term ACE is an acronym for alternating condition problem is more extensively described in Section 2. The

Cxpectation. If (X, Y) is a pair ofjointly distributed random procedure for finding the transformations is outlined in

variables, one can define f and g as the limits of the func Section 3. In Section 4 we carry out the procedure on the

tions f,, and g, determined by taking f 0(x) = x, g 0(y) . y experimental data in detail. Section 5 contains discussion

and applying the recursions and some concluding remarks.

f.+,(X) =E(g.(Y) IX) 2. Soil Water Diffusion Problem

And

The movement of water in a horizontal column of

g.+,(Y) =E(f,(X) 1 Y)l [var(E(f,(X)IY))]'A unsaturated soil is commonly modeled by means of the

The f and g determined by this process can be shown to onedimensional diffusion equation

... ax'mize COrr(f(X),9M) (subject to var(g(Y))=1). 20 . a D(O) ae ] 0

one cannot find f and g precisely by this method. but one

can derive an empirically based algorithm as a natural where 0 is the water content of the soil, t is time, x is the

nlodification of the theoretical algorithm, by replacing the position in the horizontal column, and D (0) is the

conditional expectations by s'catterplot smoothers. In their coefficient of soil water diffusivity at the moisture level 0.

"Plernentation, Breiman and Friedman used a refined ver Any two variable function 0(x, t) that satisfies (2. 1) can be

lion of the suPersn'00ther of Friedman and Stuetzle (1982). shown (cf. Jost (1960, p. 31)) to be a function of the single

105

variable 1=xlt%, which is often called the Boltzinan variable. After writing 0(1) for the new function of one variable, one can check that 0Q.) satisfies the ordinary differential equation:

C>

1 dO d Ci

[D (0) (2.2) 0

dX

From this equation one can then easily obtain the

expression for D (0) which underlies our approach to its 0

CM

estimation:

D(O)= 1 d), X(u)du, (2.3)

2 dO

where 00 is the initial water content of the soil. The simul 17

taneous appearance of both derivative and integral terms in

this expression for D (0) provides one of the most intriguing

features of its estimation. 0 1 2 3 4 5 6

The process that has been most widely used to esti 3

mate D (0) is the transientflow experiment of Bruce and 1(10 mlsec

Klute (1956). In that experiment, water is held at a

Figure 1. Scatterplot of the volumetric water content, 9,

constant head and permitted to infiltrate into a horizontal versus the Boltzman variable 2,

column containing airdry soil. After a fixed time interval,

the column is sectioned, and the water content of the indi tion. The R 2 values from the regressions of

vidual sections is determined either by weighing, or by f(O,) on g(?.i) will be used as benchmarks

other methods. The data of Clothier and Scotter (1982 ' ) on against which we will compare the R 2 from More

Manawatu sandy loam plotted in Figure 1 are typical of analytically tractable tratisformations.

those obtained through horizontal infiltration experiments.

They also give an indication of some of the inherent Step 2. Use the socalled bulging rule of Mos

difficulties in estimating D (0). For instance, many smooth teller and Tukey (1977) to suggest analytically

ing methods when applied to the data of Figure 1 would tractable functions F(O) and G(I) which retain

lead to a virtually useless estimate of the derivative of 1 the desirable properties of the functions found in

with respect to 0. We will return to the problem of estimat Step 1.

ing D (0) in Section 4. For further details on the experiment

and historical background the reader is referred to De Step 3. Perform the regression of F(O) on G (1),

Veaux and Steele (1989) and Clothier and Scotter (1982). and use diagnostic tools to assess the appropri

ateness of the linear model:

3. Estimation Process F(O) = a +bG (l)+e. (3.1)

The details of the method we propose are possibly

best explained in the context of an example such as the Step 4. If functionals of 0 or 1 need to be

analysis of D (0) of Manawatu sandy loam. Moreover, one estimated, one can use (3.1) directly to obtain

almost has to have an honest example in hand in order to estimates.

detail the role of the tools we have used to assist our

transformation choice: the bulging rule and the ACE and As an example of step 4, consider the case of the dif

AVAS algorithms. With that said, it seems useful to have a fusion equation (2.1). After performing steps 13, one

topdown view of the method of the proposed method. The would use the chain rule to extract from (3. 1) an expression

four basic steps are the following: for d 1 in terms of 0. Then either analytic or numerical

dO

Step 1. Find estimated transformationsP0) and integration is used to determine the values of the definite

g (X) from the ACE and AVAS algorithms. (We integrals:

use X for the regressor and 0 for the response.)

The transformed data values in 1(0)= 1(u)du (3.2)

both cases will exhibit a strong linear associa 00

106

i i

for ,111 00 i~ 0 5 0, Finally the diffusion coefficient D (0) is cstiniated by the expression

0

D (0) d'k J 1(u ) du (3.3)

2 dO 00

where the indicated derivative and integral are those determined previously.

4. An Example: Manawatu Sandy Loam

TO understand the extent of the linear association between X and 0 that can be achieved by marginal transformations, we examine the results of applying the ACE and AVAS algorithm. Even for the best choices of f and g, the linear association between X and 0 is imperfect. Still, the ACE transformed variables plotted in Figure 2 and the AVAS transformed variables plotted in Figure 3 exhibit iubstantially greater linear association than the plot of the untransformed variables given in Figure 1. (We have used the implementation of the empirical ACE algorithm due. to L Brieman which is incorporated in The Statistics Store (I.M. Schilling (1985)), and the implementation of AVAS obtained from R. Tibshirani (see Tibshirani (1988)). When we measure the linear association of the transformed vari

2 2

&bles in terms of R . we find respectable values of R =. 93 for ACE and R 2. .92 for AVAS. These values provide us with a benchmark, and, in fact, one of the principal benefits of the ACE and AVAS algorithms is that they provide a standard against which more analytically appealing transformations can be judged.

lk * ** *

f (0)

(M

* * **g***

1

0 1 2

9(X)

Figure 2. Scatterplot of the ACE tranyormed Oi versus the ACE transformed 4. This plot is used to assess line~ of 'he ACE transformation.

0

f (0)

C14

1

Pt

1 ** **~

1 . 1 1 1 1

1.0 0.0

00

1.0 2.0

Figure 3. Scatterplot of the AVAS transformed Oi versus the AVAS transformed 4.. This plot is used to assess linearitY of the A VAS transformation.

To aid the search for such surrogates for f and g, the ACE and AVAS transformed variables are plotted against the untransformed variables to see if simpler functional forms might suffice. Figures 4 and 5 show the plots of

1:ri !5 n and 151 !~n. respectively,

for both ACE and AVAS.

I

C4

1

22%

,I

A

1

0 1 2 3 4 56

X

Figure 4. Scatterplot of the ACE and AVAS tranffiormed Xi versus Ii ~ is used to suggest a pouer transformation approitimating g (M.

I ACE 2 = AVAS

107

standa~ (~e ~

0

P0) '

C~

1

1

141

_j

24

1 21 ".

0.10 0.20 0.30

0

Figure 5. Scatterplot of the ACE and AVAS transformed Oi

versus 0, that is used to suggest a power transformation

appro;dtwting.f (0). Notice that the bulge rule my not be

directly applicable here.

1 = ACE 2 AVAS standwlized

The hunt for analytically tractable replacements for f and g is further guided by the socalled bulging rule of Mosteller and Tukey (1977). Loosely speaking, the bulging rule suggests finding an outward normal to a smoothed plot of the data and using the signs of the normal components to guide one's choice of transfornation. For example, Figure 4 exhibits a bulge where both the x and y components of the outward normal are positive. The bulging rule then suggests that both the variables plotted on the horizontal axis should be transformed by moving up the scale of powers. In fact, the successive examination of plots of (X,x, g (Xi)), 1 s i :5 n, for larger values of cc continues to suggest moving up the scale of powers, and we are thus led to consider the exponential transformation. For comparison e 1 (standardized to have mean 0 and variance 1) is also shown in Figure 4. The exponential appears to be a compromise between the transformations suggested by ACE and AVAS. An alternative approach to this exploratory search for an appropriate transformation would be to use the method of Box and Cox (1964).

When we begin a similar examination of the plot of (0i f (m), 1 :5 i < n given in Figure 5, the bulging rule for reexpression diverges for ACE and AVAS. For the ACE transformation, there may be a modest indication that we might wish to send 0 down the scale of powers, but the indication is not supported when tried. Fortunately, we have recourse to a second approach that does suggest an appropriate transformation, and we can consider the plot of

Oi versus ek which is shown in Figure 6. After alli since we have having settled on e I as the surrogate for f (1), the principal remaining task is to determine a surrogate F for f such that the scatterplot (F(O,), ek ) is approximately linearized. The bulging rule applied to Figure 6 initially suggests that we consider a transformation F that moves 0 up the ladder of powers, and successive applications of the bulging rule eventually lead us to the choice of F (0) = 03.

a

C?

0

0 C)

CM

C)

2

1 0 1

e 1

Figure 6. Scatterplot of Oi versus ek. The bulge rule suggests going up the ladderfor either 0 or 2L

For the AVAS transformation in Figure 4, the bulging

rule is directly applicable and suggests using F (0) = &

again. (The correlation between the AVAS 1(0j) and 0 ' ? is

.999) Thus, for this data set, we are led to the same

transformation from both algorithms. Notice that G (1) = e 1

and F (0) = & preserve the homoscedasticity of the AVAS transformations and the linearity of both ACE and AVAS. Strikingly, using F (0) = 0 3 and G (1) = e X achieves an R 2 of 2

.93 that meets the level of the optimal R =93 achieved by the ACE transformations. Moreover, when we consider the plot of 0i3 versus e 2S (both standarized) given in Figure 7, the visual impact of the linear association exhibited by this figure seems to compare well with that exhibited by the ACE transformed variables of Figure 2 and the AVAS transformed variables of Figure 3. On the basis of the quantitative evidence provided by comparing R 2,S, the subjective evidence provided by comparison of the scatterplots of Figures 2, 3, and 7, and the fact that both the ACE and AVAS algorithms suggested the same transformations, it seems appropriate to settle on the transformation choices of Figure 7.

108

C\J

3

0 Residual

1 ~_6i3

C\J

e

Predicted 6J3

Figure 7. Scatterplot of Oi 3 versus c X, (standardized).

Approximate linearity is achieved with this tranVormation Figure Scatterplot of rpiduals versus predicted viduesfor

which should be compared with the ACE tranTormadonsof the model 6 1 3 =a+be ', Residuals appear to be

Figure 2. appro;dmately homoscedastic.

For the Manawatu sandy loam data our exploratory and, for the choices that were made by means of the

analysis has led us to an approximate relationship of the exploratory analysis of the Manawatu sandy loam data, one

form finds a particularly simple not result:

d 7. 1

F(O) = a +bG(I), (4.1) dO . 301(ola) (4.3)

Where F(O)=&, G0,)=e>. The coefficients in (4.1) can In order to obtain D (0) it remains only to determine

now be estimated by ordinary least squares from which we the integral of 1(0)=GI((F(O)a)lb). FortheManawatu

obtain a=4.48x10~2, and b=1.20x10~' with nominal 3_

standard errors of 5.30 x 10r4 and 3.30 x W6, reSpeCtiVely sandy loam data we find .1(0) = log Q0 a)lb), and the

In Figure 8, we show a plot of the predicted values, integral of k(O) can be determined analytically. For more

details.including a discussion of interval estimates of D (0),

versus the residuals that are obtained from fitting the model the reader is referred to De Veaux and Steele (1989).

(4.1) by ordinary least squares. The residuals appear

APProximately homoscedastic, and we have no reason to be 5. Dis . cussiofi

discontent with the estimates obtained by ordinary least

squares. If the scatterplot of Figure 8 had exhibited a Both the ACE and AVAS algorithms were used to

greater heteroseedasticity, we would have probably elected suggest analytic forms for a transformation of a regressor

to aPPlY iteratively reweighted least squares, or a similarly and response which would exhibit linearity and homos

directed technique. cedasticity. To aid the search for such functional forms,

As a final check on the reasonability of the fitted the bulging rule of Mosteller and Tukey (1977) was used

model, one should consider the fit in terms of the when appropriate. The ACE transformation, while display

2

untransformed variables as exhibited in Figure 8. This plot ing a high degree of linearity (R =.93) also showed non

constant variance in the response. The AVAS transforma

has no flagrant defects; indeed it suggests that the pro tions did nearly as well in terms of linearity (R 2 = .92) and

cedure has been a reasonable one.

BY differentiating (4.1) we find the general relation the transformed response had nearly constant variances.

For our data set, both the ACE and AVAS algorithms led to

ship the same function forms G (1) = c X and F (0) = 0' for the

d X (0) (4.2) regressor and response respectively. Functionals of the

d 0 W(X) curves were directly attainable from the linear model

b IF'(0) 1 (G* (G '((F (0) a) 1 b))), F (0) = a + bG (X) which through residual analysis seemed

109

plausible. The success of the procedure in this case sug De Veaux, R.D. and Steele. J.M. (1989). "ACE guided transfor

gests that both the ACE and AVAS transformations should mation method for estimation of the coefficient of soil

be considered as exploratory tools by the data analyst to water diffusivity," Technometrics, (In press).

suggest appropriate functional forms for transformation. DuChateau. P.C., Noffiger, D.L., Ahuja, L.R., and Schwartzen

druber. D. (1972). "Experimental curves and rates of

change from piecewise parabolic fits," Agron. J., 64, 538

542.

References Friedman, J.H. and Stuetzle, W. (1982). "Smoothing of scatter

plots," Technical Report Orion 3, Deparnnent of Statistics.

Box, G.E.P. and Cox, D.R. (1964). "An analysis of transforma Stanford University.

tions." J. Royal Statist. Soc., B26, 211243, discussion

244252. Jost, W. (1960). Diffusion in Solids, Liquids, and Gases 3rd ed.,

Academic Press, New York.

Breiman, L. and Friedman. J.H. (1985). 'Tstirnating optimal

transformations for multiple regression and correlation," J. Mosteller, F. and Tukey, LW (1977). Data Analysis and

Amer. Statist. Soc., 80, 580.597. Regression, AddisonWesley, Reading, MA.

Brace, R.R. and Klute, A. (1956). "The measurement of soil Schilling, LM. (1985). The Statistics Store, AT&T Bell Labora

moisture diffusivity," Soil Sci. Soc. Am. Proc., 20, 458 tories, Murray Hill, NJ.

462. Tibshirani, R. (1988). "Estimating transformations for regres

Clothier, B.E. and Scotter, D.R. (1982). "Constantflux sion via additivity and variance stabilization," J. Amer. Sta

infiltration from a hemispherical cavity." Soil Sci. Soc. Am. tist. Soc., 93, 394405.

J., 46, 696700.

110