QUALIFYING EXAM, STAT 541, 2008 In the very last class we very briefly had a look at the ACE algorithm. It is used to find non-linear transformations of both predictors and the response in a qualitative way, that is, not in terms of algebraic formulas but in terms of pictures of suggested transformations. The goal for this project is to write code that provides bootstrap confidence bands for the estimated transforms with 90% simultaneous coverage. You should therefore write functions that construct these simultaneous confidence bands, and you should do so for two types of bootstrap. This is the outline in the rough. Here are the details: 0) Go to the class notes (actually, the Recaps file) for instructions on how to install "acepack" from the R repository. You can use the Boston Housing data as in class, but feel free to also illustrate with other datasets. Your R code should be two functions described below, plus top level code that shows their application. 1) Study and execute the R code that illustrates the usage of the "ace()" function. Note that the transformations are returned as vectors that contain estimated function values at the observed values of the variables. These "functions" are not obtained by evaluating any analytic functions but by smoothing appropriate partial residuals. Note also a silliness: in the returned list the matrix of untransformed predictors (boston.ace$x) is strangely of size pxn, whereas the matrix of transformed predictors (boston.ace$tx) is of size nxp. 2) In class we saw qualitative inference with "spaghetti plots", that is, overplotting of curves. The goal of this project is to replace the spaghetti plots with formal inference whereby we construct upper and a lower boundary curves that together trace bands around the transformations. These upper and lower boundary curves should be calibrated to contain 90% of all bootstrap "curves". "Curve" is in quotes because we will solve the simultaneity problem only for a discretized version, not continuous functions. 3) First, write a function that collects the bootstrapped "transformations". Its primary input is a matrix "xy" of size (N)x(p+1), where the last column contains the response. a) The next argument for your function should be "nboot=99", which is the number of bootstrap iterations. Keep it small while debugging, and increase it later for production runs. In addition, implement bootstrap that uses sampling with and without replacement, and with arbitrary bootstrap sample sizes. You therefore need two more arguments: "bootsize=nrow(xy)" and "replace=T", both to be passed to "sample()". When using "replace=F", this is called subsampling as opposed to bootstrap sampling; in this case one must make sure that "bootsize" is less than "N". However, when using "replace=T", one can allow "bootsize" greater than "nrow(xy)". b) A problem arises because bootstrap samples do not contain all the observed points. A complication you have to deal with is to inter/extrapolate the bootstrapped "transformations" to a fixed grid for each variable, just as we did for cross-validation and smoothing. Give your function an argument "ngrid=100" to specify the number of grid points that span the ranges of the variables. The next order of business in your function is therefore allocating a matrix of size (ngrid)x(p+1) of grid points for the predictors and the response, whose columns are to be given to "approx()" as "x" argument for inter/extrapolating the "transformations" from bootstrapped "ace()" to the grids. Make sure you use "approx()" exactly as we did for cross-validation; do not worry if "approx()" complains about "collapsing to unique 'x' values...". c) The results should be collected in a 3-D array with dimensions (ngrid)x(p+1)x(nboot). Return a list containing the (ngrid)x(p+1) matrix of grid points and this 3-D array of bootstrap results. d) Run your function several times with the following choices: "ngrid=100" (default ok), "nboot" as large as possible given the time constraints, "replace=T" with "bootsize=N,2*N" as well as "replace=F" with "bootsize=N/2,N*2/3", for a total of 4 runs. 3) Once you have at least one 3-D array, you can start writing a function that constructs normal (mean+-t*SE) confidence bands with simultaneous coverage for all transformations at all grid points. Think of it as having (ngrid)x(p+1) statistics for which you want a simultaneous confidence box. a) Before embarking on the confidence bands, check whether bootstrap detects noticeable bias. To this end, overplot the ACE transformation with the average bootstrap curves at the grid points. Describe where you find discrepancies. (This might also be a good time to wonder about missing values in the bootstrap array. What might be their origin?) b) Make plots of the standard deviations as a function of the grid for each variable. These are the pointwise bootstrap SEs for each grid point of each variable. Do you think they should be smoothed or are they sufficiently smooth already to be trusted? What is the general pattern of these SD (actually: SE) curves? c) Write a function that accepts the bootstrap output and uses bisection to find the appropriate multiplier "t" such that the bootstrap average transformations from a) plus/minus "t" times the SEs from b) contain 90% of the bootstrap transformations simultaneously FOR ALL VARIABLES AT ALL GRID POINTS. That is, for each value of "t" you try, you must figure what fraction of (ngrid)x(p+1) bootstrap matrices are fully sandwiched by the two (ngrid)x(p+1) matrices obtained by the (ngrid)x(p+1) bootstrap average matrix +- "t" times the (ngrid)x(p+1) SE-matrix. d) Plot the bands: black: the original ACE transformations blue: the average bootstrap ACE transformations green: blue +- 1.645 * SE curves (pointwise 90% coverage) red: blue +- t * SE curves (simultaneous 90% coverage) ==> 14 frames, one per variable Same for each of the 4 runs. 4) Compare the results from the 4 runs (2 with, 2 without replacement). 5) Spend a little bit of time pondering whether the calibration method could have problems. For example, are we working with too many statistics? How many bootstrap iterations should one perform? Do you have ideas for solving the simultaneity problem differently? 6) IMPORTANT ADDITION: For different "bootsize" values you are estimating different standard errors. Propose adjustments that shrink or stretch the SE to match "bootsize=N" when using "bootsize=N/2, N*2/3" for "replace=F" and "bootsize=N*3/2" when "replace=T". You are under honor code to work individually. Email me whatever you have in ways of a solution by Friday 12 noon. Best of luck! Andreas