Estimation and Inference with Proxy Data and Its Genetic Applications
Sai Li, Tony Cai, and Hongzhe Li
Existing high-dimensional statistical methods are largely developed for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models when only "proxy data" is available. These proxies encompass marginal statistics and sample covariance matrices computed from distinct sets of individuals. We develop a rate optimal method for estimation and inference for the regression coefficient vector and its linear functionals based on the proxy data. We show the intrinsic limitations in the proxy-data based inference: the minimax optimal rate for estimation is slower than that in the conventional case where individual data are observed. These interesting findings are illustrated through simulation studies and an analysis of a dataset concerning the genetic associations of hindlimb muscle weights in a mouse population.