Probability

\[ \newcommand{\cB}{\mathcal{B}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cX}{\mathcal{X}} \newcommand{\EE}{\mathbb{E}} \newcommand{\PP}{\mathbb{P}} \newcommand{\RR}{\mathbb{R}} \newcommand{\ZZ}{\mathbb{Z}} \newcommand{\td}{\,\textrm{d}} \newcommand{\simiid}{\stackrel{\textrm{i.i.d.}}{\sim}} \newcommand{\simind}{\stackrel{\textrm{ind.}}{\sim}} \newcommand{\eqas}{\stackrel{\textrm{a.s.}}{=}} \newcommand{\eqPas}{\stackrel{\cP\textrm{-a.s.}}{=}} \newcommand{\eqmuas}{\stackrel{\mu\textrm{-a.s.}}{=}} \newcommand{\eqD}{\stackrel{D}{=}} \newcommand{\indep}{\perp\!\!\!\!\perp} \DeclareMathOperator*{\minz}{minimize\;} \DeclareMathOperator*{\maxz}{minimize\;} \DeclareMathOperator*{\argmin}{argmin\;} \DeclareMathOperator*{\argmax}{argmax\;} \newcommand{\Var}{\textnormal{Var}} \newcommand{\Cov}{\textnormal{Cov}} \newcommand{\Corr}{\textnormal{Corr}} \]

What is a probability?

There is some philosophical controversy about what probability is. Roughly speaking, there are two common answers:

Frequentist answer: Probability represents a relative frequency of an outcome when an experiment is repeated many times.
Bayesian answer: Probability represents a degree of belief that something is true, or that something will happen.

People who give the first answer tend to be doubtful that we can meaningfully assign probabilities to everything. There is no sense in which the next presidential election is an experiment that we can repeat. Below is a list of outcomes that seem more and more difficult to assign probabilities to. Think about where you would stop agreeing that probability is a meaningful construct:

The probability that a (roughly) physically symmetrical die will roll 4 on the next toss: most people, regardless of their metaphysical views about probability, would probably agree on \(1/6\), whether it is because they can imagine the die being tossed many times, because they believe the die has an equal physical propensity to land on any given face, or because they think we should be subjectively indifferent between any of the six outcomes.
The probability that a radiation treatment will be successful for a given cancer patient: we could probably give some answer based on the fraction of patients for whom the treatment is successful, but we have to be careful about what comparison group to use: we should probably only include patients whose cancer is at a similar stage and who are roughly the same age, but perhaps we should take into account other factors like the patient’s family history, other aspects of their health profile, the cancer’s genetic profile, etc. By the time we get done conditioning on everything that might matter, the patient may be the only person left in our comparison class.
The probability that the Democratic candidate will win the next presidential election: every presidential election is a truly one-off event, and anyone whose opinion is worth listening to can probably list a few important aspects of this election that are more or less unprecedented.
The probability that a given subatomic particle has the mass predicted by a certain physical theory: this is not the probability that something will happen but a probability that something is true about the world. Whether it is true or false, it has been so since the creation of the universe.
The probability that P = NP (as a statement about computational complexity theory): whether or not this is true is a purely mathematical question, but it is one that may or may not be resolved in our lifetime.
The probability that the 20th digit of \(\sqrt{2}\) is 5: this is another mathematical question that you could probably work out for yourself in a few minutes with just a pencil and paper and no new data about the world, so it seems perverse to think of it as random. Still, if you were forced to place a bet and didn’t have time to work it out you might assign a \(10\%\) probability.

The two great statistical frameworks of frequentist and Bayesian statistics loosely correspond to the two views about probability above. Generally speaking, Bayesians get a lot of mileage out of being willing to assign probabilities to everything, while frequentists try to take a more conservative course where possible, allowing some quantities to simply be unknown.

A great deal of ink has been spilled about whether one approach is superior to another. Fifty years ago, it was more common for statisticians to be dogmatically committed to one or the other view, but nowadays the mainstream view is that both approaches have strengths and weaknesses, and which framework to use is a pragmatic choice that should be made on a case-by-case basis.

Fortunately, the philosophical and methodological disagreements between frequentists and Bayesians are not disagreements about the mathematical construct of probability, only questions about how and when these mathematical formalisms should be connected to real-world questions. There is very little controversy about how we should conceive of probability mathematically. This leads us to our third answer:

Mathematical answer: a probability is a measure \(P\) on a sample space \(\cX\), which assigns total measure \(1\) to the sample space.

Probability as a measure

Measure theory is an area of mathematics concerned with measuring the “size” of subsets of a certain set. Soon after it was developed in the the early twentieth century, the great Soviet mathematician Kolmogorov realized it could be applied to give a rigorous grounding to probability theory, it was a major advance in understanding and resolving certain paradoxes in probability theory. David Aldous gives a nice discussion of this history.

This is not a course on measure-theoretic probability and we will not rigorously develop the subject. However, it will be useful to draw on some of the basics of measure theory, because the subject clarifies the meaning of conditioning and allows us to simplify our notation; for example, the measure-theoretic idea of a density includes both probability density functions and probability mass functions, allowing us to cleanly state results that apply both to discrete and continuous random variables, as well as to discrete-continuous mixtures, random graphs, and even random variables defined on more exotic sample spaces like manifolds.

Measures

Given a set \(\cX\), a measure \(\mu\) is a certain kind of function mapping (non-pathological¹) subsets \(A \subseteq \cX\) to non-negative numbers \(\mu(A) \in [0,\infty]\).

Example 1 (Counting measure): If \(\cX\) is countable, e.g. \(\cX = \mathbb{Z}\), then a natural measure is the counting measure \(\#(A)\), which simply counts the number of points in a subset \(A\). That is, \(\#(\{0,1\}) = 2\), and \(\#(\{2,4,6,8,\ldots\}) = \infty\) .

Example 2 (Lebesgue measure): If \(\cX = \RR^n\) for some integer \(n\), a natural measure is the Lebesgue measure \(\lambda(A)\), which returns the volume of a subset \(A\). Roughly speaking, we can write

\[ \lambda(A) = \int \cdots \int_A \,d x_1\,d x_2\cdots \,d x_n. \]

Example 3 (Gaussian measure): Now taking \(\cX = \RR\), we might instead want to define the “size” of a set as the probability that a standard Gaussian random variable \(Z \sim \cN(0,1)\) is observed to be in the set \(A\). That is, we can define the measure:

\[ P_Z(A) = \PP(Z \in A) = \int_A \phi(x)\,d x, \quad \text{ where } \;\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \]

is the probability density function of \(Z\).

As it turns out, it is not so obvious how to define what exactly we mean by taking an integral when the set is sufficiently pathological: some sets are just called non-measurable and we can’t hope to meaningfully assign them a measure. Sets of this kind are important to consider when building a rigorous theory about measures but they are not the sort of thing you would stumble upon unless you went out looking for them.

One of the original motivations for measure theory was to provide a framework for excluding these pathological sets and rigorously defining integrals over the other, nicer sets. In general, the domain of a measure is not all subsets of \(\cX\) (called the power set and notated \(2^{\cX}\)), but rather a collection of “nice enough” subsets \(\cF \subseteq 2^{\cX}\).

Formally, the collection \(\cF\) must be a \(\sigma\)-field, meaning that it satisfies certain closure properties. We say \(\cF\) is a \(\sigma\)-field (or \(\sigma\)-algebra) if

The full set \(\cX\) is in \(\cF\).
If \(A\) is in \(\cF\) then its complement \(\cX \setminus A\) is also in \(\cF\) (i.e., \(\cF\) is closed under complementation)
If \(A_1,A_2,\ldots \in \cF\) then \(\bigcup_{i=1}^\infty A_i\) is also in \(\cF\) (i.e. \(\cF\) is closed under countable unions)

The details of this definition are not important for purposes of this course.

Example: If \(\cX\) is countable we can take \(\cF\) to be the entire power set.

Example: If \(\cX = \mathbb{R}^n\) we will typically use the Borel \(\sigma\)-field \(\cB\), defined as the smallest \(\sigma\)-field that includes all open rectangles \((a_1,b_1)\times (a_2,b_2) \times \cdots \times (a_n, b_n)\), where \(a_i < b_i\) for all \(i\). That is, we start with the open rectangles and recursively apply the closure properties to obtain a very large collection of sets, which informally we can think of as containing all non-pathological subsets of \(\mathbb{R}^n\).

We are now ready to define a measure. We call a pair of a set \(\cX\) and an associated \(\sigma\)-field \(\cF \subseteq 2^{\cX}\) a measurable space.

Definition: Given a measurable space \((\cX, \cF)\), a measure is a function \(\mu: \cF \to [0,\infty]\) (inclusive of \(+\infty\)) satisfying three properties:

Non-negativity: \(\mu(A) \geq 0\) for all \(A\in \cF\).
Countable additivity: If \(A_1,A_2,\ldots\in \cF\) are all disjoint, then

\[ \mu\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty \mu(A_i) \]

Empty set maps to zero: \(\mu(\emptyset) = 0\)²

If \(\mu\) is a measure on \((\cX, \cF)\) we call \((\cX, \cF, \mu)\) a measure space. In the special case \(\mu(\cX) = 1\), we call \(\mu\) a probability measure and \((\cX, \cF, \mu)\) is called a probability space.

Integrals

One very nice thing about measures is that they let us define integrals of (nice enough) real-valued functions on \(\cX\) with respect to the measure \(\mu\), meaning the integral is “weighted” in a way that assigns total weight \(\mu(A)\) to each set \(A\). We will use the notation \(\int f(x)\,\,d\mu(x)\), or just \(\int f \,d\mu\).

To construct this integral, we begin by defining it for indicator functions, and then extend to more general functions by linearity and limits, in a few steps:

First, for an indicator function \(1_A(x) = 1\{x \in A\}\) of a set \(A \in \cF\), it is straightforward to define the integral as \(\int 1_A \,d\mu = \mu(A)\) (note if \(A \notin \cF\) this does not work, but we are only defining the integral for a class of “nice” functions determined by our \(\sigma\)-field)

Next, consider a simple function \(f(x) = \sum_{i=1}^\infty c_i 1_{A_i}(x)\), with all \(c_i \geq 0\) and \(A_i \in \cF\). Because the integral should be linear, we should have

\[ \int f\,d\mu = \sum_{i=1}^\infty c_i \int 1_{A_i}\,d\mu = \sum_{i=1}^\infty c_i \mu(A_i) \]

Third, we can extend to all sufficiently nice non-negative functions by approximating them from below with a series of simple functions:

\[ \int f\,d\mu = \lim_{n=1}^\infty \int f_n\,d\mu. \]

For example, we could take the sequence of simple functions \[f_n(x) = \sum_{k=0}^{\infty} k 2^{-n} 1_{A_{n,k}}(x),\quad \text{ where } A_{n,k} = \left\{x:\; f(x) \in \left[k 2^{-n}, (k+1) 2^{-n}\right)\right\},\] as illustrated in the picture below.

Plot = require("@observablehq/plot")
d3 = require("d3")

// Define the Gaussian function
function gaussian(x, mu = 0, sigma = 1) {
  return (1 / (sigma * Math.sqrt(2 * Math.PI))) * 
    Math.exp(-0.5 * Math.pow((x - mu) / sigma, 2));
}

// Define our function f(x) = 3g(x-4) + 5g(x-7)
function f(x) {
  return 3 * gaussian(x - 4, 0, 1) + 4.8 * gaussian(x - 7, 0, 1);
}

// Create a slider for 'c'
viewof c = Inputs.range([0, 4], {step: 1, label: "n"})

// Generate x values
xValues = d3.range(-2, 13, 0.01)

// Calculate y values
data = xValues.map(x => ({x, y: f(x)}))

// Calculate Lebesgue sets with multiple intervals
function calculateLebesgueSets(data, c) {
  const maxY = d3.max(data, d => d.y);
  const sets = d3.range(0, Math.ceil(maxY * Math.pow(2, c))).map(k => {
    const lower = k * Math.pow(2, -c);
    const upper = (k + 1) * Math.pow(2, -c);
    const points = data.filter(d => d.y > lower && d.y <= upper);
    
    if (points.length > 0) {
      // Identify distinct intervals
      const intervals = [];
      let currentInterval = [points[0]];
      for (let i = 1; i < points.length; i++) {
        if (points[i].x - points[i-1].x <= 0.011) { // Allow small gaps
          currentInterval.push(points[i]);
        } else {
          intervals.push(currentInterval);
          currentInterval = [points[i]];
        }
      }
      intervals.push(currentInterval);

      return {
        k,
        lower,
        upper,
        intervals: intervals.map(interval => ({
          xMin: d3.min(interval, d => d.x),
          xMax: d3.max(interval, d => d.x),
          points: interval
        }))
      };
    }
    return null;
  }).filter(set => set !== null);
  
  return sets;
}

lebesgueSets = calculateLebesgueSets(data, c)

// Create the plot
plot = Plot.plot({
  marks: [
    Plot.line(data, {x: "x", y: "y", stroke: "gray", strokeWidth: 1}),
    Plot.dot(lebesgueSets.flatMap(set => set.intervals.flatMap(int => int.points)), {
      x: "x", 
      y: "y", 
      fill: d => d3.schemeCategory10[lebesgueSets.findIndex(s => s.intervals.some(int => int.points.includes(d))) % 10],
      r: 2
    }),
    Plot.rectY(lebesgueSets.flatMap(set => 
      set.intervals.map(int => ({
        x1: int.xMin,
        x2: int.xMax,
        y1: 0,
        y2: set.lower,
        color: set.k
      }))
    ), {
      x1: "x1",
      x2: "x2",
      y1: "y1",
      y2: "y2",
      fill: d => d3.schemeCategory10[d.color % 10],
      fillOpacity: 0.2,
      stroke: d => d3.schemeCategory10[d.color % 10],
      strokeOpacity: 0.5
    })
  ],
  y: {
    domain: [0, d3.max(data, d => d.y) * 1.1],
    label: "f(x)"
  },
  x: {
    domain: [-2, 13],
    label: "x"
  },
  style: {
    backgroundColor: "white"
  },
  width: 800,
  height: 500
})

Finally, we can write any real-valued function as the sum of its positive and negative parts, \(f(x) = f^+(x) - f^-(x)\), where \(f^+(x) = \max\{f(x), 0\}\) and \(f^-(x) = \max\{-f(x), 0\}\). Then both \(f^+\) and \(f^-\) have non-negative (possibly infinite) integrals. Then we simply take

\[ \int f\,d\mu = \int f^+\,d\mu - \int f^-\,d\mu \in [-\infty, \infty], \]

calling the difference undefined if the integrals of both \(f^+\) and \(f^-\) are infinite.

As a result, we have \(\int f\,d\mu\) for any function \(f\) whose positive and negative parts can both be approximated from below by simple functions. Note that we have left out some important details in this presentation (for example we have not characterized which functions \(f\) are nice enough to be approximated well by simple functions) but these details are unimportant for this class. The important thing to know is that to any measure \(\mu\) there corresponds a well-defined integral \(\int \cdot \,d\mu\), which behaves as we would expect it to.

We can now return to our previous examples of measures and ask what the corresponding integrals are:

Example 1, continued (Counting measure): An integral with respect to \(\#\) just adds up all the values of \(f(x)\):

\[ \int f\,d\# = \sum_{x\in \cX} f(x) \]

Example 2, continued (Lebesgue measure): An integral with respect to the Lebesgue measure is called a Lebesgue integral, which is essentially just the usual integral you are used to from calculus class:

\[ \int f\,d\lambda = \int\cdots \int f(x) \,d x_1 \cdots \,d x_n. \]

The Lebesgue integral extends the Riemann integral to a more general class of functions, in the sense that if the Riemann integral of \(f\) is defined then the Lebesgue integral is also well defined and the two integrals coincide. But the Lebesgue integral is also well-defined for functions like \(f(x) = 1\{x \in \mathbb{Q}\}\), for which the Riemann integral is not well-defined (Exercise: what is the Lebesgue integral of \(1\{x \in \mathbb{Q}\}\)?)

Example 3, continued (Gaussian measure): Note that \(P_Z(A)\) is defined as the (Lebesgue) integral of \(1_A(x)\phi(x)\). By extension, the integral of \(f\) with respect to \(P_Z\) is the Lebesgue integral of \(f(x) \phi(x)\), which is nothing more than the expectation of \(f(Z)\):

\[ \int f\,d P_Z = \int_{-\infty}^\infty f(x) \phi(x) \,d x = \EE[f(Z)]. \]

Densities

We have just seen in the last two examples that there is a special relationship between the Lebesgue measure \(\lambda\) on \(\RR\) and the Gaussian measure \(P_Z\), allowing us to evaluate integrals with respect to \(P_Z\) by turning them into integrals with respect to \(\lambda\), namely \(\int f(x)\,d P_Z(x) = \int f(x)\phi(x) \,d\lambda(x)\).

This is a happy fact: mathematicians have gone to a lot of trouble figuring out how to calculate Lebesgue integrals of various functions, and it’s nice not to have to reinvent the wheel!

Note that we can’t turn integrals for every random variable into Lebesgue integrals. If \(Y\) follows a binomial distribution, for example, we can just as well define a probability measure \(P_Y(A) = \PP(Y \in A)\), but there is no counterpart to \(\phi\) that would let us turn \(P_Y\) integrals into Lebesgue integrals in the same way.

Formally, consider a measurable space \((\cX, \cF)\), with two measures \(P\) and \(\mu\). We say \(P\) is absolutely continuous with respect to \(\mu\) if \(P(A) = 0\) whenever \(\mu(A) = 0\). In notation, we write \(P \ll \mu\).

If \(P \ll \mu\) then, under mild conditions, we can always define a density function \(p:\cX \mapsto [0,\infty)\) such that

\[ P(A) = \int 1_A(x) p(x)\,d\mu(x), \quad \text{ for all } A \in \cF, \]

and by extension \(\int f(x)\,d P(x) = \int f(x) p(x) \,d\mu(x)\).

The function \(p\) is called the density function or Radon-Nikodym derivative of \(P\) with respect to \(\mu\). It is sometimes written using the suggestive notation \(\frac{\,d P}{\,d\mu}(x)\). Whenever we have a density function we can turn integrals with respect to \(P\) into integrals with respect to \(\mu\) simply by multiplying the density \(p\) into the integrand.

If we do not specify what \(\mu\) is, it is assumed to be the Lebesgue measure; that is, if we say \(P\) is absolutely continuous with no further elaboration, we mean \(P \ll \lambda\).

If \(P\) is a probability measure and \(\mu\) is the Lebesgue measure, then \(p(x)\) is a probability density function (pdf). If \(\mu\) is a counting measure, we call \(p(x)\) the probability mass function (pmf).

Probability spaces and random variables

In a typical statistics problem we may have multiple random variables, some continuous and some discrete, some random variables may be functions of others, and we want to have good notation for the probability that “something happens,” like \(\PP(X^2 < (Y+Z)/ W)\), or the expectation of a random variable, like \(\EE[(XY-ZW)^2]\). What does this kind of expression have to do with the measures and integrals we have been discussing so far?

If we defined a probability measure \(P\) as the joint distribution of \((X, Y, Z, W)\) then we could evaluate \(P\) on the corresponding subset of the sample space, e.g. \(P(\{x,y,z,w :\; x^2 < (y+z)/w\})\). Or we could write an appropriate integral like \(\int (xy-zw)^2\,dP(x,y,z,w)\). But it is convenient not to have to make things so explicit. To this end, we can introduce the idea of an abstract outcome \(\omega\) in an outcome space \(\Omega\), which informally represents “all of the information required to evaluate every random variable in the problem.” Then a random variable is simply any function of \(\omega\), and we can think of \(\PP\) as a measure on \(\Omega\) and \(\EE\) as an integral with respect to \(\PP\). Then our usual way of writing probabilities and expectations become shorthand for the corresponding measure and integral over \(\Omega\); e.g. \[\PP(X^2 < (Y+Z)/W) = \PP\left(\{\omega \in \Omega:\; X(\omega)^2 < (Y(\omega) + Z(\omega))/W(\omega)\}\right),\] and \[\EE((XY-ZW)^2) = \int_\Omega (X(\omega)Y(\omega) - Z(\omega)W(\omega))^2 \,d\PP(\omega),\] and it is understood that we will never bother to make explicit what kind of object \(\omega\) is. A subset of \(\Omega\) to which \(\PP\) assigns a probability is called an event and any function of \(\omega\) is a random variable. If \(\PP(A) = 1\), we say \(A\) occurs almost surely.

To do actual calculations of probabilities and expectations we use the distributions of the random variables involved in that particular calculation. The distribution of a random variable \(X(\omega)\) is given by the push-forward measure defined as \(Q = \PP \circ X^{-1}\). That is, if \(X\) has realizations in the sample space \(\cX\), and \(B\) is a (measurable) subset of \(\cX\), then \[Q(B) = \PP(X^{-1}(B)) = \PP(\{\omega \in \Omega:\; X(\omega) \in B \}).\]

Conditional probability

While it is beyond the scope of this course, measure theory also allows us to patch the definition of conditional probability and conditional expectation. Given two events \(A\) and \(B\), if \(\PP(B) > 0\), we can unproblematically define the conditional probability of \(A\) given \(B\) as \(\PP(A \mid B) = \PP(A \cap B) / \PP(B)\), but this definition obviously fails when \(\PP(B) = 0\).

Generally speaking, we cannot necessarily define \(\PP(A \mid B)\) for measure zero events \(B\) (Homework 0 includes a problem illustrating the inherent ambiguity of this definition). But, for example, if \(X\) and \(Y\) are both continuous random variables with some dependence between them we would like to be able to discuss, e.g., the distribution or expectation of \(Y\) given that \(X\) takes on some specific value \(x\). We can do this by defining the conditional expectation \(\EE(Y \mid X)\) as a random variable \(g(X)\), which has the property \(\EE[(Y - g(X)) 1_A(X)] = 0\) for all (nice) subsets \(A\). By evaluating this function \(g\) at \(x\) we can answer the question we asked earlier. However, note this explanation is informal and brushes many important points under the rug; for a more complete explanation, take Stat 205A.

Having defined the conditional expectation, we can also ask about the conditional distribution of \(Y\) by evaluating the conditional expectation on new random variables defined with indicator functions: \(\PP(Y \in A \mid X) = \EE[1_A(Y) \mid X]\) .

Footnotes

A problem in the first homework will guide you step-by-step in constructing the kind of pathological set that makes the parenthetical caveat necessary when we are dealing with continuous spaces.↩︎
The requirement that \(\mu(\emptyset) = 0\) is redundant if \(\mu(\cX)\) is finite. Otherwise, it prevents us from assigning infinite measure to every set including \(\emptyset\).↩︎