Supervised Topic Modeling: Optimal Estimation and Statistical Inference
Ruijia Wu, Linjun Zhang, and Tony Cai
Abstract:
The rapid growth of digital textual data from a variety of fields has made it increasingly important to develop statistical methods for the analysis of such data with rigorous theoretical guarantees. In this paper, we focus on supervised topic modeling within the framework of generalized linear models (GLMs) and probabilistic latent semantic indexing (pLSI) models. One of the major challenges of the analysis is that the covariates are unobservable. We propose a novel bias-adjusted estimator of the covariates and use it to estimate the regression vector. We establish minimax optimal rates of convergence and show that the proposed estimator is rate-optimal up to a logarithmic factor. In addition, we consider statistical inference for individual regression coefficients and construct confidence intervals based on an asymptotically unbiased and normally distributed estimator. The effectiveness of our proposed algorithms is demonstrated through simulation studies and applications to the analysis of a movie review dataset and a gut microbiome dataset.