## Scribbles

Lecture notes on Theory of Statistical Learning and Sequential Prediction (STAT928) are shaping up, still work in progress.

Slides for the Machine Learning Summer School in Kyoto, 2012

Lecture notes on online convex optimization, written mostly in 2008 at UC Berkeley (latest revision April 2009).

## Preprints

Online Nonparametric Regression (with K. Sridharan). Submitted.
Abstract: We establish optimal rates for online regression for arbitrary classes of regression functions in terms of the sequential entropy introduced in (Rakhlin, Sridharan, Tewari, 2010). The optimal rates are shown to exhibit a phase transition analogous to the i.i.d./statistical learning case, studied in (Rakhlin, Sridharan, Tsybakov 2013). In the frequently encountered situation when sequential entropy and i.i.d. empirical entropy match, our results point to the interesting phenomenon that the rates for statistical learning with squared loss and online nonparametric regression are the same.
In addition to a non-algorithmic study of minimax regret, we exhibit a generic forecaster that enjoys the established optimal rates. We also provide a recipe for designing online regression algorithms that can be computationally efficient. We illustrate the techniques by deriving existing and new forecasters for the case of finite experts and for online linear regression.

Entropy, Minimax Regret and Minimax Risk (with K. Sridharan and A. Tsybakov). Submitted.
Abstract: We consider the random design regression model with square loss. We propose a method that aggregates empirical minimizers (ERM) over appropriately chosen random subsets and reduces to ERM in the extreme case, and we establish sharp oracle inequalities for its risk. We show that, under the $\epsilon^{-p}$ growth of the empirical $\epsilon$-entropy, the excess risk of the proposed method attains the rate $n^{-\frac{2}{2+p}}$ for $p\in(0,2]$ and $n^{-1/p}$ for $p> 2$ where $n$ is the sample size. Furthermore, for $p\in(0,2]$, the excess risk rate matches the behavior of the minimax risk of function estimation in regression problems under the well-specified model. This yields a conclusion that the rates of statistical estimation in well-specified models (minimax risk) and in misspecified models (minimax regret) are equivalent in the regime $p\in(0,2]$. In other words, for $p\in(0,2]$ the problem of statistical learning enjoys the same minimax rate as the problem of statistical estimation. On the contrary, for $p>2$ we show that the rates of the minimax regret are, in general, slower than for the minimax risk. Our oracle inequalities also imply the $v\log(n/v)/n$ rates for Vapnik-Chervonenkis type classes of dimension $v$ without the usual convexity assumption on the class; we show that these rates are optimal. Finally, for a slightly modified method, we derive a bound on the excess risk of $s$-sparse convex aggregation improving that of (Lounici, 07) and providing the optimal rate.

Efficient Sampling from Time-Varying Distributions (with H. Narayanan). Submitted.
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes and closely tracks a time-varying log-concave distribution. We develop general theoretical guarantees on the required number of steps; this number can be calculated on the fly according to the distance from and the shape of the next distribution. We then illustrate the technique on several examples. Within the context of exponential families, the proposed method produces samples from a posterior distribution which is updated as data arrive in a streaming fashion. The sampling technique can be used to track time-varying truncated distributions, as well as to obtain samples from a changing mixture model, fitted in a streaming fashion to data. In the setting of linear optimization, the proposed method has oracle complexity with best known dependence on the dimension for certain geometries. In the context of online learning and repeated games, the algorithm is an efficient method for implementing no-regret mixture forecasting strategies. Remarkably, in some of these examples, only one step of the random walk is needed to track the next distribution.

On Zeroth-Order Stochastic Convex Optimization via Random Walks (with T. Liang and H. Narayanan). Submitted.
Abstract: We propose a method for zeroth order stochastic convex optimization that attains the suboptimality rate of n^{7}T^{-1/2} after T queries for a convex bounded function f:R^n \to R. The method is based on a random walk (the Ball Walk) on the epigraph of the function. The randomized approach circumvents the problem of gradient estimation, and appears to be less sensitive to noisy function evaluations compared to noiseless zeroth order methods.

## Publications

Sequential Complexities and Uniform Martingale Laws of Large Numbers (with K. Sridharan and A. Tewari). To appear in Probability Theory and Related Fields.
Abstract: We establish necessary and sufficient conditions for a uniform martingale Law of Large Numbers. We extend the technique of symmetrization to the case of dependent random variables and provide sequential'' (non-i.i.d.) analogues of various classical measures of complexity, such as covering numbers and combinatorial dimensions from empirical process theory. We establish relationships between these various sequential complexity measures and show that they provide a tight control on the uniform convergence rates for empirical processes with dependent data. As a direct application of our results, we provide exponential inequalities for sums of martingale differences in Banach spaces.
We demonstrate the utility of these results in two domains. First, we consider the problem of sequential prediction. Analogous to the role of classical empirical process theory in statistical learning (with i.i.d. data), the developed theory is shown to yield precise learning guarantees for the problem of sequential prediction. In particular, the minimax learning rate is shown to be tightly controlled by the universal uniform convergence rates for empirical processes. As a second (direct) application of our results, we provide exponential inequalities for sums of martingale difference sequences in Banach spaces.

Partial monitoring -- classification, regret bounds, and algorithms (with G. Bartók, D. Foster, D. Pál, and C. Szepesvári). To appear in Mathematics of Operations Research.
Abstract: In a partial monitoring game, the learner repeatedly chooses an action, the environment responds with an outcome, and then the learner suffers a loss and receives a feedback signal, both of which are fixed functions of the action and the outcome. The goal of the learner is to minimize his regret, which is the difference between his total cumulative loss and the total loss of the best fixed action in hindsight. In this paper we characterize the minimax regret of any partial monitoring game with finitely many actions and outcomes. It turns out that the minimax regret of any such game is either zero, Θ( T ), Θ(T^{2/3}), or Θ(T). We provide computationally efficient learning algorithms that achieve the minimax regret within logarithmic factor for any game. In addition to the bounds on the minimax regret, if we assume that the outcomes are generated in an i.i.d. fashion, we prove individual upper bounds on the expected regret.

On Semi-Probabilistic Universal Prediction (with K. Sridharan). Proceedings of IEEE Information Theory Workshop, 2013. Invited paper.
Abstract: We discuss two scenarios of universal prediction, as well as some recent advances in the study of minimax regret and algorithmic development. We then propose an intermediate scenario, the Semi-Probabilistic Setting, and make progress towards understanding the associated minimax regret.

Optimization, Learning, and Games with Predictable Sequences (with K. Sridharan). NIPS 2013.
Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for offline optimization, prove an extension to Holder-smooth functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a finite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T)/T). This addresses a question of Daskalakis et al 2011. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem.

Online Learning of Dynamic Parameters in Social Networks (with S. Shahrampour and A. Jadbabaie). NIPS 2013.
Abstract: This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.

Competing with Strategies (with W. Han and K. Sridharan). COLT 2013.
Abstract: We study the problem of online learning with a notion of regret defined with respect to a set of strategies. We develop tools for analyzing the minimax rates and for deriving regret-minimization algorithms in this scenario. While the standard methods for minimizing the usual notion of regret fail, through our analysis we demonstrate existence of regret-minimization methods that compete with such sets of strategies as: autoregressive algorithms, strategies based on statistical models, regularized least squares, and follow the regularized leader strategies. In several cases we also derive efficient learning algorithms.

Online Learning with Predictable Sequences (with K. Sridharan). COLT 2013.
Abstract:   We present methods for online linear optimization that take advantage of benign (as opposed to worst-case) sequences. Specifically if the sequence encountered by the learner is described well by a known predictable process'', the algorithms presented enjoy tighter bounds as compared to the typical worst case bounds. Additionally, the methods achieve the usual worst-case regret bounds if the sequence is not benign. Our approach can be seen as a way of adding prior knowledge about the sequence within the paradigm of online learning. The setting is shown to encompass partial and side information. Variance and path-length bounds (Hazan and Kale, 2010, Chiang et al 2012) can be seen as particular examples of online learning with simple predictable sequences.
We further extend our methods and results to include competing with a set of possible predictable processes (models), that is learning'' the predictable process itself concurrently with using it to obtain better regret guarantees. We show that such model selection is possible under various assumptions on the available feedback. Our results suggest a promising direction of further research with potential applications to stock market and time series prediction.

Localization and Adaptation in Online Learning (with O. Shamir and K. Sridharan). AISTATS 2013.
Abstract: We introduce a formalism of localization for online learning problems. Similarly to statistical learning theory, localization can be used to obtain fast rates. We introduce local sequential Rademacher complexities and other local measures. Based on the idea of relaxations for deriving algorithms, we provide a template method that takes advantage of localization. Furthermore, we build a general adaptive method that can take advantage of the suboptimality of the observed sequence. We illustrate the utility of the introduced concepts on several problems. Among them is an upper bound on regret in terms of classical Rademacher complexity when the data are i.i.d.

Stochastic Convex Optimization with Bandit Feedback (with A. Agarwal, D. Foster, D. Hsu, and S. Kakade). SIAM Journal on Optimization, 23-1 (2013), pp. 188--212.
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observed noisy realizations of the function value f(x) at any query point x in X. The quantity of interest is regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs poly(d) T  regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T.

Relax and Randomize: From Value to Algorithms (with O. Shamir and K. Sridharan). NIPS 2012. Full arXiv version.
Abstract:  We show a principled way of deriving online learning algorithms from a minimax analysis. Various upper bounds on the minimax value, previously thought to be non-constructive, are shown to yield algorithms. This allows us to seamlessly recover known methods and to derive new ones. Our framework also captures such unorthodox'' methods as Follow the Perturbed Leader and the R^2 forecaster. We emphasize that understanding the inherent complexity of the learning problem leads to the development of algorithms.
We define local sequential Rademacher complexities and associated algorithms that allow us to obtain faster rates in online learning, similarly to statistical learning theory. Based on these localized complexities we build a general adaptive method that can take advantage of the suboptimality of the observed sequence.
We present a number of new algorithms, including a family of randomized methods that use the idea of a random playout''. Several new versions of the Follow-the-Perturbed-Leader algorithms are presented, as well as methods based on the Littlestone's dimension, efficient methods for matrix completion with trace norm, and algorithms for the problems of transductive learning and prediction with static experts.

Making Stochastic Gradient Descent Optimal for Strongly Convex Problems (with O. Shamir and K. Sridharan). ICML 2012.
Abstract: Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SGD in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal O(1/T) rate. However, for non-smooth problems, the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis. On the flip side, we show that a simple modification of the averaging step suffices to recover the O(1/T) rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

Interior-Point Methods for Full-Information and Bandit Online Learning (with J. Abernethy and E. Hazan). IEEE Transactions on Information Theory, vol 58, issue 7, pp. 4164--4175, 2012.
Abstract: We study the problem of predicting individual sequences with linear loss with full and partial (or bandit) feed- back. Our main contribution is the first efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal Õ(√(T)) regret. In addition, for the full-information setting, we give a novel regret minimization algorithm. These results are made possible by the introduction of interior-point methods for convex optimization to online learning.

No Internal Regret via Neighborhood Watch (with D. Foster). AISTATS 2012.
Abstract: We present an algorithm which attains O( T ) internal (and thus external) regret for finite games with partial monitoring under the local observability condition. Recently, this condition has been shown by (Bartók, Pál, and Szepesvári 2011) to imply the O( T ) rate for partial monitoring games against an i.i.d. opponent, and the authors conjectured that the same holds for non-stochastic adversaries. Our result is in the affirmative, and it completes the characterization of possible rates for finite partial-monitoring games, an open question stated by (Cesa-Bianchi, Lugosi, and Stoltz 2006). Our regret guarantees also hold for the more general model of partial monitoring with random signals.

Lower Bounds for Passive and Active Learning (with M. Raginsky). NIPS 2011.
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander's capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of disagreement coefficient." For passive learning, our lower bounds match the upper bounds of Gine and Koltchinskii up to constants and generalize analogous results of Massart and Nédélec. For active learning, we provide first known lower bounds based on the capacity function rather than the disagreement coefficient.

Complexity-Based Approach to Calibration with Checking Rules (with D. Foster, K. Sridharan and A. Tewari). COLT 2011.
Abstract: We consider the problem of forecasting a sequence of outcomes from an unknown source. The quality of the forecaster is measured by a family of checking rules. We prove upper bounds on the value of the associated game, thus certifying the existence of a calibrated strategy for the forecaster. We show that complexity of the family of checking rules can be captured by the notion of a sequential cover introduced in (Rakhlin, Sridharan, Tewari, 2010). Various natural assumptions on the class of checking rules are considered, including finiteness of Vapnik-Chervonenkis and Littlestone's dimensions.

Online Learning: Stochastic and Constrained Adversaries (with K. Sridharan and A. Tewari). NIPS 2011.
Abstract: Learning theory has largely focused on two main learning scenarios. The first is the classical statistical setting where instances are drawn i.i.d. from a fixed distribution and the second scenario is the online learning, completely adversarial scenario where adversary at every time step picks the worst instance to provide the learner with. It can be argued that in the real world neither of these assumptions are reasonable. It is therefore important to study problems with a range of assumptions on data. Unfortunately, theoretical results in this area are scarce, possibly due to absence of general tools for analysis. Focusing on the regret formulation, we define the minimax value of a game where the adversary is restricted in his moves. The framework captures stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We then consider the i.i.d. adversary and show equivalence of online and batch learnability. In the supervised setting, we consider various hybrid assumptions on the way that x and y variables are chosen. Finally, we consider smoothed learning problems and show that half-spaces are online learnable in the smoothed model. In fact, exponentially small noise added to adversary's decisions turns this problem with infinite Littlestone's dimension into a learnable problem.

Information-Based Complexity, Feedback, and Dynamics in Convex Programming (with M. Raginsky). IEEE Transactions on Information Theory, 2011.
Abstract: We study the intrinsic limitations of sequential convex optimization through the lens of feedback information theory. In the oracle model of optimization, an algorithm queries an oracle for noisy information about the unknown objective function, and the goal is to (approximately) minimize every function in a given class using as few queries as possible. We show that, in order for a function to be optimized, the algorithm must be able to accumulate enough information about the objective. This, in turn, puts limits on the speed of optimization under specific assumptions on the oracle and the type of feedback. Our techniques are akin to the ones used in statistical literature to obtain minimax lower bounds on the risks of estimation procedures; the notable difference is that, unlike in the case of i.i.d. data, a sequential optimization algorithm can gather observations in a controlled manner, so that the amount of information at each step is allowed to change in time. In particular, we show that optimization algorithms often obey the law of diminishing returns: the signal-to-noise ratio drops as the optimization algorithm approaches the optimum. To underscore the generality of the tools, we use our approach to derive fundamental lower bounds for a certain active learning problem. Overall, the present work connects the intuitive notions of information'' in optimization, experimental design, estimation, and active learning to the quantitative notion of Shannon information.

Online Learning: Beyond Regret (with K. Sridharan and A. Tewari). COLT 2011.
Abstract: We study online learnability of a wide class of problems, extending the results of (Rakhlin, Sridharan, Tewari, 2010) to general notions of performance measure well beyond external regret. Our framework simultaneously captures such well-known notions as internal and general Phi-regret, learning with non-additive global cost functions, Blackwell's approachability, calibration of forecasters, adaptive regret, and more. We show that learnability in all these situations is due to control of the same three quantities: a martingale convergence term, a term describing the ability to perform well if future is known, and a generalization of sequential Rademacher complexity, studied in (Rakhlin, Sridharan, Tewari, 2010). Since we directly study complexity of the problem instead of focusing on efficient algorithms, we are able to improve and extend many known results which have been previously derived via an algorithmic construction.

Online Learning: Random Averages, Combinatorial Parameters, and Learnability (with K. Sridharan and A. Tewari). NIPS 2010.
Abstract: We study learnability in the online learning model. We define several complexity measures which capture the difficulty of learning in a sequential manner. Among these measures are analogues of Rademacher complexity, covering numbers and fat shattering dimension from statistical learning theory. Relationship among these complexity measures, their connection to online learning, and tools for bounding them are provided. In the setting of supervised learning, finiteness of the introduced scale-sensitive parameters is shown to be equivalent to learnability. The complexities we define also ensure uniform convergence for non-i.i.d. data, extending the uniform Glivenko-Cantelli type results. We conclude by showing online learnability for an array of examples.

Random Walk Approach to Regret Minimization (with H. Narayanan). NIPS 2010.
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies.

Online Convex Programming in Adaptive Control (with M. Raginsky and S. Yüksel), IEEE Conference on Decision and Control, 2010.
Abstract: Online Convex Programming (OCP) is a recently developed model of sequential decision-making in the presence of time-varying uncertainty. In this framework, a decision-maker selects points in a convex feasible set to respond to a dynamically changing sequence of convex cost functions. A generic algorithm for OCP, often with provably optimal performance guarantees, is inspired by the Method of Mirror Descent (MD) developed by Nemirovski and Yudin in the 1970's. This paper highlights OCP as a common theme in adaptive control, both in its classical variant based on parameter tuning and in a more modern supervisory approach. Specifically, we show that: (1) MD leads to a generalization of classical adaptive control schemes based on recursive parameter tuning; (2) A supervisory controller switching policy that uses OCP to estimate system parameters from a sequence of appropriately regularized output prediction errors can flexibly adapt to presence or absence of output disturbances in the system.

Quantitative Analysis of Systems Using Game-Theoretic Learning (with S. Seshia). ACM Transactions on Embedded Computing Systems, 2010.
Abstract: The analysis of quantitative properties, such as timing and power, is central to the design of reliable embedded software and systems. However, the verification of such properties on a program is made difficult by their heavy dependence on the program's environment, such as the processor it runs on. Modeling the environment by hand can be tedious, error-prone and time consuming. In this paper, we present a new, game-theoretic approach to analyzing quantitative properties that is based on performing systematic measurements to automatically learn a model of the environment. We model the problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to accurately predict the property of interest while the adversary sets environment states and parameters. To solve this problem, we employ a randomized strategy that repeatedly tests the program along a linear-sized set of program paths called basis paths, using the resulting measurements to infer a weighted-graph model of the environment, from which quantitative properties can be predicted. We prove that our algorithm can, under certain assumptions and with arbitrarily high probability, accurately predict properties such as worst-case execution time or estimate the distribution of execution times.

Information Complexity of Black-Box Convex Optimization: A New Look via Feedback Information Theory (with M. Raginsky). Allerton Conference on Communication, Control, and Computing, 2009.
Abstract: This paper revisits information complexity of black-box convex optimization, first studied in the seminal work of Nemirovski and Yudin, from the perspective of feedback information theory. These days, large-scale convex programming arises in a variety of applications, and it is important to refine our understanding of its fundamental limitations. The goal of black-box convex optimization is to minimize an unknown convex objective function from a given class over a compact, convex domain using an iterative scheme that generates approximate solutions by querying an oracle for local information about the function being optimized. The information complexity of a given problem class is defined as the smallest number of queries needed to minimize every function in the class to some desired accuracy. We present a simple information-theoretic approach that not only recovers many of the results of Nemirovski and Yudin, but also gives some new bounds pertaining to optimal rates at which iterative convex optimization schemes approach the solution. As a bonus, we give a particularly simple derivation of the minimax lower bound for a certain active learning problem on the unit interval.

A Stochastic View of Optimal Regret through Minimax Duality (with J. Abernethy, A. Agarwal, and P. Bartlett). COLT 2009.
Abstract: We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.

Beating the Adaptive Bandit with High Probability (with J. Abernethy). Information Theory and Applications Workshop, 2009; COLT 2009.
The tech report version with all the proofs is here.
Abstract: We provide a principled way of proving O(\sqrt{T}) high-probability guarantees for partial-information (bandit) problems over arbitrary convex decision sets. First, we prove a regret guarantee for the full-information problem in terms of "local" norms, both for entropy and self-concordant barrier regularization, unifying these methods. Given one of such algorithms as a black-box, we can convert a bandit problem into a full-information problem using a sampling scheme. The main result states that a high-probability O(\sqrt{T}) bound holds whenever the black-box, the sampling scheme, and the estimates of missing information satisfy a number of conditions, which are relatively easy to check. At the heart of the method is a construction of linear upper bounds on confidence intervals. As applications of the main result, we provide the first known efficient algorithm for the sphere with an O(\sqrt{T}) high-probability bound. We also derive the result for the n-simplex, improving the O(\sqrt{nT\log(nT)}) bound of Auer et al by replacing the logT term with loglogT and closing the gap to the lower bound of Ω(\sqrt{nT}). While O(\sqrt{T}) high-probability bounds should hold for general decision sets through our main result, construction of linear upper bounds depends on the particular geometry of the set; we believe that the sphere example already exhibits the necessary ingredients. The guarantees we obtain hold for adaptive adversaries (unlike the in-expectation results of Abernethy et al) and the algorithms are efficient, given that the linear upper bounds on confidence can be computed.

Matrix Regularization Techniques for Online Multitask Learning (with A. Agarwal and P. Bartlett). Technical Report, 2008.
Abstract: In this paper we examine the problem of prediction with expert advice in a setup where the learner is presented with a sequence of examples coming from different tasks. In order for the learner to be able to benefit from performing multiple tasks simultaneously, we make assumptions of task relatedness by constraining the comparator to use a lesser number of best experts than the number of tasks. We show how this corresponds naturally to learning under spectral or structural matrix constraints, and propose regularization techniques to enforce the constraints. The regularization techniques proposed here are interesting in their own right and multitask learning is just one application for the ideas. A theoretical analysis of one such regularizer is performed, and a regret bound that shows benefits of this setup is reported.

Game-Theoretic Timing Analysis (with S. Seshia). IEEE/ACM Conference on Computer-Aided Design (ICCAD), 2008.
Abstract: Estimating the worst-case execution time (WCET) of tasks is a key step in the design of reliable real-time software and systems. In this paper, we present a new, game-theoretic approach to estimating WCET based on performing directed measurements on the target platform. We model the estimation problem as a game between our algorithm (player) and the environment of the program (adversary), where the player seeks to find the longest path through the program while the adversary sets environment parameters to thwart the player. We present both theoretical and experimental results demonstrating the utility of our approach. On the theoretical side, we prove that our algorithm can converge to find the longest path with high probability. Experimental results indicate that our approach is competitive with an existing technique based on static analysis and integer programming. Moreover, the approach can be easily applied to even complex hardware/software platforms.

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization (with J. Abernethy and E. Hazan). COLT 2008.
Abstract: We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O^*(\sqrt{T})$ regret. The setting is a natural generalization of the non-stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.

High probability regret bounds for online optimization (with P. Bartlett, V. Dani, T. Hayes, S. Kakade, and A. Tewari). COLT 2008.
Abstract: We present a modification of the algorithm of Dani et al. for the online linear optimization problem in the bandit setting, which with high probability has regret at most $O^*(\sqrt{T})$ against an adaptive adversary. This improves on the previous algorithm of Dani et al whose regret is bounded \emph{in expectation } against an \emph{oblivious} adversary. We obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al. for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.

Optimal Strategies and Minimax Lower Bounds for Online Convex Games (with Jacob Abernethy, Peter Bartlett, and Ambuj Tewari). COLT 2008.
Abstract: A number of learning problems can be cast as an Online Convex Game: on each round, a learner makes a prediction $x$ from a convex set, the environment plays a loss function $f$, and the learner's long-term goal is to minimize regret. Algorithms have been proposed by Zinkevich, when $f$ is assumed to be convex, and Hazan et al., when $f$ is assumed to be strongly convex, that have provably low regret. We consider these two settings and analyze such games from a minimax perspective, proving minimax strategies and lower bounds in each case. These results prove that the existing algorithms are essentially optimal.

Closing the Gap between Bandit and Full-Information Online Optimization: High-Probability Regret Bound (with Peter Bartlett and Ambuj Tewari), 2007
Abstract: We demonstrate a modification of the algorithm of Dani et al for the online linear optimization problem in the bandit setting, which allows us to achieve an $O(\sqrt{T\ln T})$ regret bound {\it in high probability}, as opposed to the {\it in expectation } result of Dani et al. Moreover, we obtain the same dependence on the dimension ($n^{3/2}$) as that exhibited by Dani et al. The results of this paper rest firmly on those of Dani et al and the remarkable technique of Auer et al for obtaining high-probability bounds via optimistic estimates. This paper answers an open question: it eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings.
Note: A similar analysis has been carried out independently by Dani, Hayes & Kakade. A merged version of our results is being submitted.

Adaptive Online Gradient Descent (with Peter Bartlett and Elad Hazan), NIPS 2007. Technical report version available here.
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates between $\sqrt{T}$ and $\log T$. Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms.

Online Discovery of Similarity Mappings (with Jacob Abernethy and Peter Bartlett), ICML 2007.
Abstract: We consider the problem of choosing, sequentially, a map which assigns elements of a set A to a few elements of a set B. On each round, the algorithm suffers some cost associated with the chosen assignment, and the goal is to minimize the cumulative loss of these choices relative to the best map on the entire sequence. Even though the offline problem of finding the best map is provably hard, we show that there is an equivalent online approximation algorithm, Randomized Map Prediction (RMP), that is efficient and performs nearly as well. While drawing upon results from the Online Prediction with Expert Advice'' setting, we show how RMP can be utilized as an online approach to several standard batch problems. We apply RMP to online clustering as well as online feature selection and, surprisingly, RMP often outperforms the standard batch algorithms on these problems.

Multitask Learning with Expert Advice (with Jacob Abernethy and Peter Bartlett), COLT 2007. Technical report version available here.
Abstract: We consider the problem of prediction with expert advice in the setting where a forecaster is presented with several online prediction tasks. Instead of competing against the best expert separately on each task, we assume the tasks are related, and thus we expect that a few experts will perform well on the entire set of tasks. That is, our forecaster would like, on each task, to compete against the best expert chosen from a small set of experts. While we describe the "ideal" algorithm and its performance bound, we show that the computation required for this algorithm is as hard as computation of a matrix permanent. We present an efficient algorithm based on mixing priors, and prove a bound that is nearly as good for the sequential task presentation case. We also consider a harder case where the task may change arbitrarily from round to round, and we develop an efficient approximate randomized algorithm based on Markov chain Monte Carlo techniques.

Stability of K-means Clustering (with Andrea Caponnetto). NIPS, 2006.
Abstract: We phrase K-means clustering as an empirical risk minimization procedure over a class $H_K$ and explicitly calculate the covering number for this class. Next, we show that stability of K-means clustering is characterized by the geometry of $H_K$ with respect to the underlying distribution. We prove that in the case of a unique global minimizer, the clustering solution is stable with respect to complete changes of the data, while for the case of multiple minimizers, the change of $\Omega(n^{1/2})$ samples defines the transition between stability and instability. While for a finite number of minimizers this result follows from multinomial distribution estimates, the case of infinite minimizers requires more refined tools. We conclude by proving that stability of the functions in $H_K$ implies stability of the actual centers of the clusters. Since stability is often used for selecting the number of clusters in practice, we hope that our analysis serves as a starting point for finding theoretically grounded recipes for the choice of K.

Stability Properties of Empirical Risk Minimization over Donsker Classes (with Andrea Caponnetto). Journal of Machine Learning Research. Vol. 7 (Dec), 2565--2583, 2006.
(Older version as a technical report: AI Memo 2005-018. May 2005)
Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number $n$ of samples grows, the $L_2$-diameter of the set of almost-minimizers of empirical error with tolerance $\xi(n)=o(n^{-\frac{1}{2}})$ converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as $n$ increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than $n^{-1/2}$.

Risk Bounds for Mixture Density Estimation (with Dmitry Panchenko and Sayan Mukherjee). ESAIM Probability and Statistics. Vol. 9, 220-229, June 2005.
(Older version as a technical report: AI Memo 2004-001. Jan 2004)
Abstract: In this paper we focus on the problem of estimating a bounded density using a finite combination of densities from a given class. We consider the Maximum Likelihood Estimator (MLE) and the greedy procedure described by Li and Barron \cite{LiBarron99,Li99} under the additional assumption of boundedness of densities. We prove an $O(\frac{1}{\sqrt{n}})$ bound on the estimation error which does not depend on the number of densities in the estimated combination. Under the boundedness assumption, this improves the bound of Li and Barron by removing the $\log n$ factor and also generalizes it to the base classes with converging Dudley integral.

Stability Results in Learning Theory (with Sayan Mukherjee and Tomaso Poggio). Analysis and Applications, Special Issue on Learning Theory. Vol. 3, No. 4, 397-419. October 2005.
Abstract: The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various {\it stability assumptions} can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.

On Stability and Concentration of Measure (with Sayan Mukherjee and Tomaso Poggio). CBCL Paper, Massachusetts Institute of Technology, Cambridge, MA, June 2004.

B (with Poggio, T., S. Mukherjee, R. Rifkin and A. Verri). In: Uncertainty in Geometric Computations, J. Winkler and M. Niranjan (eds.), Kluwer Academic Publishers, 131-141, 2002.
Abstract: In this chapter we discuss the role of $b$, which is the constant in the standard form of the solutions provided by the Support Vector Machine technique $f(\bx x)=\Sum_i^l$\alpha_i$K({\bf x}, {\bf x}_i) + b$, which is a special case of Regularization Machines. In the process, we describe properties of Reproducing Kernel Hilbert Spaces induced by different classes of kernels.

Bagging Regularizes (with T. Poggio, R. Rifkin, S. Mukherjee). AI Memo 2002-003, Massachusetts Institute of Technology, Cambridge, MA, February 2002.
Abstract: Intuitively, we expect that averaging -- or bagging -- different regressors with low correlation should smooth their behavior and be somewhat similar to regularization. In this note we make this intuition precise. Using an almost classical definition of stability, we prove that a certain form of averaging provides generalization bounds with a rate of convergence of the same order as Tikhonov regularization -- similar to fashionable RKHS-based learning algorithms.

"Extra-label Information: Experiments with View-based Classification." (with G. Yeo and T. Poggio). Proceedings of the Sixth International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES'2002), Podere d'Ombriano, Crema, Italy, September 16-18, 2002.
Abstract: Extra information is often readily available but not utilized in a classification paradigm. Here we explore using extra labels (profile faces and rotated faces) to aid in distinguishing faces versus non-faces. We propose a way to combine simple discriminant classifiers to build a more complex ones and justify the combination in a probabilistic setting.