Please see PDF version

Contractno: BAP098NIRG15Med

Probability Theory

LM. Steele

Department of Statistics, Wharton School, University of Pennsylvania

Probability theory is a branch of mathematics that has evolved from the investigation of social, behavioral, and physical phenomena that are influenced by randomness and uncertainty. For much of its early life, probability theory dealt almost exclusively with gambling games, and, even though there were important contributions made by such distinguished mathematicians as Pierre de Fermat, Blaise Pascal, and PierreSimon de Laplace, the field lacked respectability and failed to attract sustained attention.

One cause of the slow development of probability theory in its early days was the lack of a widely accepted foundation. Unlike the geometry of Euclid, or even the analytical investigations of Newton and Leffinitz, the theory of probability seemed to be eternally tied to the modelling process. In this respect, probability theory had greater kinship with the theories of heat or elasticity than with the pristine worlds of geometry or algebra.

Int. Encyc. Social and Behavioral Sciences 18 March 2003

2 Over time, foundations for probability were proposed by a number of deep thinking individuals including von Mieses, de Finetti, and Keynes, but the approach that has come to be the most widely accepted is the one that was advanced in 1933 in the brief book Foundations ofProbability Theory by Andrey Nikolayevich Kolmogorov.

Kolmogorov's approach to the foundations of probability theory developed naturally from the theory of integration that was introduced by Henri Lebesgue and others during the first two of decades of the twentieth century. By leaning on the newly developed theory of integration, Kohnogorov demonstrated that probability theory could be viewed simply as another branch of mathematics. After Kolmogorov's work, probability theory had the same relationship to its applications that one finds for the theory of difrerential equations. As a consequence, the stage was set for a long and productive mathematical development.

Too be sure, there are some philosophical and behavioral issues that are not well addressed by Kohnogorov's barebones foundations, but, over the years, Kolmogorov's approach has been found to be adequate for most purposes. The Kolmogorov axioms are remarkably succinct, yet they have all the power that is needed to capture the physical, social, and behavioral intuition that a practical theory of probability must address.


1 Kolmogorov's Axiomatic Foundations

Central to Kohnogorov's foundation for probability theory was his introduction of a triple P) that is now called a probability space. The triple's first element, the sample space 9, is only required to be a set, and, on the intuitive level, one can think of Q as the set of all possible outcomes of an experiment. For example, in an experiment where one rolls a traditional sixfaced die, then one can take 9

{ 1, 2, 3, 4, 5, 6}.

The second element of Kohnogorov's probability space is a collection F of subsets of Q that satisfy three basic consistency properties that will be described shortly. On the intuitive level, one can think of the elements of F as "events" that may occur as a consequence of the experiment described by (9, F, P). Thus, to continue with the example of rolling a die, the set A = {l, 3, 5} C Q would correspond to the event that one rolls an odd number.

Two of the three consistency properties that Kohnogorov imposes on the F are quite trivial. First, F is required to contain 9. Second, F must be closed under complementation; so, for example, if A e F then Ac E F where A' = {w : w E Q and w ~ A}. The third condition is only a bit more elaborate; the collection F must be closed under countable unions. Thus, if Ak E F for k = 1, 2,..., then one requires that the union A, U A2 U ... of all of the events in the countable set {Ak : 1 < k < oo} must again be an element of F.

The most interesting element of Kohnogorov's triple (Q, JF, P) is the probability

4 measure P. Formally, P is just a function that assigns a real number to each of the elements of J7, and, naturally enough, one thinks of P(A) as the probability of the event A. Thus, one can specify a probability model for the outcome of rolling a fair die, by defining P for A c 9 by P(A) = 11AI, where JAI denotes the number of

elements of the set A.

For sample spaces 9 that are not finite, more care is required in the specification of the probability measures P. To deal with general Q, Kolmogorov restricted his attention to those P that satisfy three basic axioms:

Axiom 1. For all A E! F, one has P(A) > 0.

Axiom 2. P (Q) = 1.

Axiom 3. For any countable collection {Ak E F k < oo} with the property
that Aj n Ak 0 for all i :~ k, one has
P ( Uw A) = j: P(Ai).
i=1 i=1

Axioms 1 and 2 just reflect the intuitive view that P(A) measures the frequency
with which A occurs in an imagined sequence of repetitions of the experiment
(Q, Y7, P). For most mathematicians, Axiom 3 also just reflects the most basic in
tuition about the way probabilities should behave. Nevertheless, Axiom 3 is more
subtle because it deals with an infinite collection of sets, and, in some ways, such
collections are outside of our direct experience. This has lead some philosophers to
examine the possibility of avoiding Kohnogorov's third axiom, and, over the years,

5 various attempts have been made to replace Kolmogorov's third axiom with the simpler assumption offinite additivity.

2 Random Variables

In most applications of probability theory, the triple P) that gives life to
probability as a rigorous mathematical subject is almost invisible. In practice, builders
of probability models take advantage of various shortcuts that have the effect of
keeping the probability triple at a respectful distance. The most important of these
shortcuts is the notion of a random variable.

On the intuitive level, a random variable is just a number that depends on a chancedriven experiment. More formally, a random variable X is a function from 9 to the real numbers with the property that {w : X (w) :5 t} E F for all t. What drives this definition is that one inevitably wants to talk about the probability that X is less than t, and, for such talk to make sense under Kohnogorov's framework, the set {w : X (w) < t} must be an element of F. Random variables make the modeler's job easier by providing a language that relates directly to the entities that are of interest in a probability model.

3 The Distribution Function and Related Quantities

There are several ways to specify the basic probabilistic properties of a random variable, but the most fundamental description is given by the distribution function

6 of X, which is defined by F(t) = P(X < t). Knowledge of the distribution function tells one everything that there is to know about the probability theory of the random variable. Sometimes it even tells too much.

The knowledge one has of a random variable is often limited, and in such cases it may be useful to focus on just part of the information that would be provided by the distribution flinction. One common way to specify such partial information is to use the median or the quantiles of the random variable X. The median is defined to be a number m for which one has P(X < m) :5 1/2 < P(X < m). Roughly speaking, the median m splits the set of outcomes of X into two halves so that the top half and the bottom half each have probability that is close to onehalf.

The quantile xp does a similarjob. It splits the the possible values of X into disjoint sets, so that one has P(X < xp) :5 p :5 P(X < xp). When p = 1/2, the quantile XP reduces to the median, and, when p = 1/4 or p = 3/4, then xp is called the lower quartile, or the upper quartile, respectively.

4 Mass Functions and Densities

If a random variable X only takes values from a finite or countable set S, then X is called a discrete random variable. In such cases, one can also give a complete description of the probabilistic properties of X by specification of the probability mass function, which is defined by setting f (x) = P ({x}) for all x E S. Here, one should note that the probability mass function f, permits one to recapture the



distribution function by the relation

F (t) = E f (x) for all t.

There are many important random variables whose values are not confined to any countable set, but, for some of these random variables, there is still a description of their distribution functions that is roughly analogous that provided for discrete random variables. A random variable X is said to be absolutely continuous provided that there exist a function f such that the distribution function of X has a representation of the form

F(t) 1 f (x) dx for all t.

The function f in this representation is called the density of X, and many random variables of practical importance have densities. The most famous of these are the standard normal (or, standard Gaussian) random variables that have the density

1 for all  oe < x < w,
f (x) = 72Pr e
which has a key role in the Central Limit Theorem that will be discussed shortly.

Much of the theory of probability can be developed with just the notion of discrete or absolutely continuous random variables, but there are natural random variables that do not fit into either class. For this reason, the distribution function remains the fundamental tool for describing the probabilities associated with a random variable; it alone is universal.

8 5 Mathematical Expectation: A Fundamental Concept

If X is a discrete random variable, then its mathematical expectation (or just expectation, for short) is defined by

E(X) xf (x),

where f is the probability mass function of X. To illustrate the this definition, one can take X to be the outcome of rolling a fair die, so that f (x) = 1/6 for all x e S = {l, 2,_,6} and

E(X) = 1 . 1 +2. 1 ++6. 1  7
6 6 6 2

In a way that is parallel  yet not perfectly so  the expectation of an absolutely continuous random variable X is defined by the integral

E(X) = 1 xf (x) dx.

Here one needs to work a bit harder to illustrate the definition. Consider, for example, an experiment where one chooses a number at random out of the unit interval [0, 11. From the intuitive meaning of the experiment, one has P(A) = a for A = [0, a], and, from this relationship, one can calculate that the density function for X is given by f (x) = 1 for x G [0, 1] and by f (x) 0 otherwise. For the density f, one therefore finds that

E (X) = 1 xf (x) dx x. l dx = 2
00 0



9 From this formula, one sees that the expected value of a number chosen at random from the unit interval is equal to onehalf, and for many people this assertion is perfectly reasonable. Still, one should note that the probability that X equals onehalf is in fact equal to zero, so the probabilistic use of the word "expected value" differs modestly from the daytoday usage.

The probability distribution function and the expectation operation provide almost all of the language that is needed to describe the probability theory of an individual random variables. To be sure, there are several further notions of importance, but these may all be expressed in terms of the distribution or the expectation. For example, the most basic measure of dispersion for the random variable X is its variance, which is defined in terms of the expectation by the formula

Var(X) = E(X _ P)2 where p = E (X).

Finally, the standard deviation of X, which is defined to be the square root of Var(X), provides a useful measure of scale for problems involving X.

6 Introducing Independence

The world of probability theory does not stop with the description of a single random variable. In fact, it becomes rich and useful only when one considers collections of random variables  especially collections of independent random variables.

Two events A and B in JP are said to be independent provided that they satisfy the


10 identity,

P(A n B) = P(A)P(B).

What one hopes to capture with this definition is the notion that the occurrence of A has no influence on the occurrence of B  and vice versa. Nevertheless, the quantities that appear in the defining formula of independence are purely mathematical constructs, and any proof of independence ultimately boils down to the proof of the defining identity in a concrete model (Q, F, P).

Consider, for example, the experiment of rolling two fair dice, one red and one blue. The sample space Q for this experiment may be taken to be the 36 pairs (j, k) with 1 < j < 6 and 1 < k < 6, where one thinks of j and k as the number rolled on the red and blue die, respectively. Here, for any A C: Q one can set P(A) = j A 1 / j Q 1 in order to obtain a model for a pair of fair dice. Under this probability model, there many pairs of events that one can prove to be independent. In particular, one can prove that the event of rolling an even number on the blue die is independent of rolling an odd number on the red die. More instructively, one can also prove that the event of rolling an even number on the blue die is independent of the parity of the sum of the two dice.

7 Extending Independence

The concept of independence can be extended to random variables by defining X and Y to be independent provided that the events {X < s} and {Y < t} are independent for all real s and t. One easy consequence of this definition is that for

i i i

any pair of monotone functions 0 and V) the random variables O(X) and O(Y) are independent whenever X and Y independent. This mapping property reconciles nicely with the intuitive assertion: if X and Y have no influence on each other, then neither should O(X) and O(Y) have any influence on each other.

The importance of independence for probability theory will be underscored by the theorems of the next section, but first the notion of independence must be extended to covermore thanjustpairs of random variables. For a finite collection of n random variables X,, X2,..., X, the condition for independence is simply that

P(Xl :~ th X2 < t2,..., Xn < t,') = P(Xl :5 4) P(X2:5 t2) ... P(Xn :5 t.)

for all real values ti, t2, ..., t,,. Finally, an infinite collection of random variables {X, : s E S} is said to be independent provided that every finite subset of the collection is independent.

8 The Law of Large Numbers

Any mathematical theory that hopes to reflect realworld random phenomena must provide some rigorous interpretation of the intuitive "law of averages." Kohnogorov's theory provides many such results, the most important of which is given in the following theorem.

Theorem 1 (Law of Large Numbers). Suppose that {Xi : 1 < i < oo} is a sequence of independent random variables with a common distribution function F(.), so P(Xi < t) = F(t) for all 1 < i < oo and all real t. If the expectation




i 11


E(Xl) is welldefined and finite, then the random variables

1 {Xl + X2 + ... + Xn} n converge to their common mean p with probability one. More precisely, if one lets

A= W : lim 1 {Xl (W) + X2 (W) + ... + Xn
noo n then the event A satisfies P(A) = 1.

9 The Central Limit Theorem

The second great theorem of probability theory is the famous Central Limit Theorem. Although it is not tied as tightly to the meaning of probability as the Law of Large Numbers, the Central Limit Theorem is key to many of the practical applications of probability theory. In particular, it often provides a theoretical interpretation for the bell curve that emerges in countless empirical investigations.

Theorem 2 (Central Limit Theorem). Suppose that {Xj : 1 < i < oo} is a sequence of independent random variables with a common distribution function F. If these random variables have a finite variance Var(Xi) = (72, then

limp( {Xl+X2+...+Xnnm}j2=, f

10 Stochastic Processes

The most fundamental results of probability theory address the behavior of sums of independent random variables, but many applications of probability theory lead


i i

1 11

13 to sequences of random variables {Xn : 1 < n < oo} that may not be independent. Such sequences are called stochastic processes, and they are of many different types.

The simplest nontrivial stochastic process is the Markov chain which is used to model random phenomena where X,+, depends on X,, but, given Xn, the value of X,+, does not depend on the rest of the past Xn 1, Xn2, X,. To construct such a process, one most often begins with an n x n matrix T {pij} with entries that satisfy 0 < pij :~ 1 and with row sums that satisfy pil + A2 + . .. + Pin = 1. One then generates the Markov chain by making sequential selections from the set S = {1, 2,..., n} in accordance with the rows of the transition matrix T. Specifically, if Xn = i, then Xn+l is obtained by choosing an element of S in accordance with the probabilities (pij) given by the ith row of T.

A second group of stochastic processes that is of considerable practical importance is the set of martingales. Roughly speaking, these are stochastic process that have the property that the expectation of Xn+l given the values of X, Xn1,..., X, is just equal to Xn. One reason that martingales are important is that they provide a model for the fortune of an individual who gambles on a fair game. Such gambling games are relatively unimportant in themselves, but many economic and financial questions can be reframed as such games. As a consequence, the theory of martingales has become an essential tool in the pricing of stock options and other derivative securities.

14 The theory of stochastic processes is not confined to just those sequences {X, : 1 < n < oo} with the discrete index set {1 < n < oo}, and, in fact, almost any set S can serve as the index set. When one takes S to be the set of nonnegative real numbers, the index is often interpreted as time, and in this case one speaks of continuous time stochastic processes. The most important of these are the Poisson process and Brownian motion. Brownian motion is arguably the most important stochastic process.

11 Directions for Further Reading

For a generation, Feller (1968) has served as an inspiring introduction to probability theory. The text assumes only a modest background in calculus and linear algebra, yet it goes quite far. The text of Dudley (1989) is addressed to more mathematically sophisticated readers, but it contains much material that is accessible to readers at all levels. In particular, Dudley (1989) contains many informative historical notes with careful references to the original sources.

For an introduction to the theory of stochastic processes, the text by (~inlar (1975) is recommended, and, for an easy introduction to Brownian motion, martingales, and their applications in finance one can consult Steele (2000).

For background on the early development of probability theory, the books of David (1962) and Stigler (1986) are useful, and the article by Doob (1994) helps make the link to the present. Finally, for direct contact with the master, anyone does well to

15 read Kolmogorov (1933).


(~inlar, E. (1975). Introduction to Stochastic Processes, PrenticeHall, Englewood

Cliffs, NJ.

David, F.N. (1962). Games, Gods, and Gambling: The Origins and History ofProb

abilityfrom the Earliest 777mes to the Newtonian Era. Griffin, London.

Doob, Joseph L. (1994). The development of rigor in mathematical probability

(19001950), in Development ofMathematics 19001950, LR Pier, ed. Birkhauser

Verlag, Basel.

Dudley, R.M. (1989). Real Analysis and Probability. WadsworthBrooks/Cole, Pa

cific Grove.

Feller, W. (1968). An Introduction to Probability and Its Applications. Vol. 1, 3rd

Ed., Wiley, New York.

Kolmogorov, A.N. (1933). Grundbegriffe der Wahrscheinlichtkeitrechnung, Springer

Verlag, Berlin. (English translation: N. Morrison (1956), Foundations of the Theory

ofProbability, Chelsea, New York.)

Steele, LM. (2000). Stochastic Calculus and Financial Applications, Springer, New

16 York.

Stigler, S.M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Harvard University Press, Cambridge, MA.

J.M. Steele