p-values, confidence regions, and (mis-)interpreting Tests

\[ \newcommand{\cB}{\mathcal{B}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cP}{\mathcal{P}} \newcommand{\cX}{\mathcal{X}} \newcommand{\EE}{\mathbb{E}} \newcommand{\PP}{\mathbb{P}} \newcommand{\RR}{\mathbb{R}} \newcommand{\ZZ}{\mathbb{Z}} \newcommand{\td}{\,\textrm{d}} \newcommand{\simiid}{\stackrel{\textrm{i.i.d.}}{\sim}} \newcommand{\simind}{\stackrel{\textrm{ind.}}{\sim}} \newcommand{\eqas}{\stackrel{\textrm{a.s.}}{=}} \newcommand{\eqPas}{\stackrel{\cP\textrm{-a.s.}}{=}} \newcommand{\eqmuas}{\stackrel{\mu\textrm{-a.s.}}{=}} \newcommand{\eqD}{\stackrel{D}{=}} \newcommand{\indep}{\perp\!\!\!\!\perp} \DeclareMathOperator*{\minz}{minimize\;} \DeclareMathOperator*{\maxz}{maximize\;} \DeclareMathOperator*{\argmin}{argmin\;} \DeclareMathOperator*{\argmax}{argmax\;} \newcommand{\Var}{\textnormal{Var}} \newcommand{\Cov}{\textnormal{Cov}} \newcommand{\Corr}{\textnormal{Corr}} \newcommand{\ep}{\varepsilon} \]

As we have previously defined hypothesis tests, they are characterized by dichotomous accept/reject decisions after choosing a null hypothesis, a test statistic, and a critical threshold. Sometimes we really do need to make a dichotomous decision (for example, the FDA really has to decide whether to approve a drug or not), but this is rare in practice. If our test statistic is large enough to reject \(H_0:\;\theta=0\) at the \(\alpha = 0.05\) level, we would usually still be interested in questions like:

These questions can be answered by \(p\)-values and confidence regions, which enrich our dichotomous decision by respectively telling us about the outcome for other \(\alpha\) values we could have used, and for other null hypotheses we could have tested.

p-Values

Informal definition

The \(p\)-value \(p(X)\) is a measure of whether our data set would have led us to reject the null at various different \(\alpha\) values. If we are rejecting for large values of a test statistic \(T(X)\) then this boils down to asking how extreme \(T(X)\) is relative to its null distribution, leading to the familiar informal definition of the \(p\)-value:

Definition (Informal): The \(p\)-value is the probability for a test statistic \(T(X)\) to be at least as large as its realized value, under the assumption that the null is true. That is, for a fixed value \(x\in\cX\), the \(p\)-value \(p(x)\) should be \(\PP_{H_0}(T(X)\geq T(x))\), or more precisely \[ p(x) = \sup_{\theta\in\Theta_0} \PP_{\theta}(T(X) \geq T(x)), \] allowing for the possibility of a composite null. Then the random variable \(p(X)\) is the \(p\)-value.

Example: Binomial If \(X\sim \text{Binom}(n,\theta)\) and we want to test \(H_0:\;\theta\leq 0.5\) vs \(H_0:\;\theta > 0.5\), the UMP test rejects for large values of \(X\). Thus, the \(p\)-value is \[ p(x) = \sup_{\theta\leq 0.5} \PP_\theta(X\geq x) = \PP_{0.5}(X\geq x), \] since \(X\) is stochastically increasing and the probability is therefore maximized at the boundary.

Example: \(Z\)-test If \(X\sim N(\theta,1)\) and we are testing \(H_0:\;\theta = 0\) vs \(H_1:\;\theta \neq 0\), the two-sided test rejects for large \(T(X)=|X|\). The two-sided \(p\)-value is therefore \[ p(x) = \PP_0(|X|>|x|) = 2(1-\Phi(|x|)). \]

Formal definition

Not all tests are easily characterized as rejecting when some \(T(X)\) is above a threshold; for example, a two-sided UMPU test rejects when some \(T(X)\) is either large or small. Thus it is useful to have a more general definition:

Assume we are testing \(H_0:\;\theta\in\Theta_0\) vs \(H_1:\;\theta\in\Theta_1\) in a model \(\cP\) based on data \(X\), and that we have a test \(\phi_\alpha\) for every significance \(\alpha \in [0,1]\): \[ \sup_{\theta\in\Theta_0} \EE_\theta \phi_\alpha(X) \leq \alpha. \] Assume further that \(\phi_{\alpha}\) is non-decreasing in \(\alpha\) (when the test rejects for smaller/stricter \(\alpha\), it also rejects for larger/more lenient \(\alpha\)): \[ \phi_{\alpha_1}(X) \leq \phi_{\alpha_2}(X) \quad \text{ if } \alpha_1\leq \alpha_2. \]

Definition (Formal): Then, we can define the \(p\)-value with respect to this family of tests as the value of \(\alpha\) for which the test barely rejects: \[ p(x) = \sup \{\alpha:\; \phi_\alpha(x) < 1\} = \inf \{\alpha:\; \phi_\alpha(x) = 1\}, \] and in terms of the rejection regions: \[ p(x) = \sup \{\alpha:\; x \notin R_\alpha\} = \inf\{\alpha:\; x \in R_\alpha\}. \]

Example: Exponential Suppose that we are testing \(H_0:\;\theta=1\) vs \(H_1:\;\theta\neq 1\) in the model \(X \sim \text{Exp}(\theta)\). We can use either the equal-tailed test, or the UMPU test. Consider a value \(x>1\), which will be in the acceptance region (for sufficiently small \(\alpha\)) or the right lobe of the rejection region (for sufficiently large \(\alpha\)). For either test, the acceptance region’s right boundary decreases continuously with \(\alpha\), so the \(p\)-value is the unique value of \(\alpha\) for which \(x\) is on the boundary. For the equal-tailed test, we have at that \(\alpha\) value \[ \alpha/2 = \PP_1(X>x) = e^{-x}, \] so \(p(x) = 2e^{-x}\). For the UMPU test \(p(x)\) is defined implicitly as the value of \(\alpha\) for which \(c_2(\alpha) = x\), which we can solve for numerically.

This formal definition reduces to our informal definition if the test \(\phi_\alpha\) rejects for large \(T(X)\) and the critical threshold is tight:

Proposition: Assume that for each \(\alpha\), we reject for large \(T(X)\), taking the threshold \(c_\alpha\) as small as possible while achieving Type I error control:[^1] \[ c_\alpha = \min \left\{c:\; \PP_\theta(T(X) > c) \leq \alpha, \text{ for all } \theta\in\Theta_0 \right\}, \] noting that the minimum is well-defined because (complementary) CDFs are right-continuous.

At the boundary, we either

  • (non-randomized \(\phi\)) reject if \(\PP_\theta(T(X) \geq c_\alpha) \leq \alpha\) for all \(\theta\in\Theta_0\), or

  • (randomized \(\phi\)) reject with probability \[ \gamma_\alpha = \max\left\{ \gamma:\; \PP_\theta(T > c_\alpha) + \gamma\PP_\theta(T = c_\alpha) \leq \alpha, \forall \theta\in\Theta_0\right\} \]

Then the two definitions of \(p(x)\) coincide.

Proof: In the non-randomized case, define \(\gamma_\alpha = 1\) if we reject at the boundary and \(0\) otherwise.

Let \(p_1(x) = \sup_{\theta\in\Theta_0} \PP_\theta(T(X)\geq T(x))\), and \(p_2(x) = \sup\{\alpha:\; \phi_\alpha(x) < 1\}\). We have \[ \begin{aligned} p_1(x) > \alpha &\iff \PP_\theta(T(X) \geq T(x)) > \alpha, \text{ for some } \theta\in\Theta_0\\ &\iff c_\alpha > x, \text{ or } c_\alpha = x \text{ and } \gamma_\alpha < 1\\ &\iff \phi_\alpha(x) < 1. \end{aligned} \] But then \[ p_2(x) = \sup\{\alpha:\; p_1(x) > \alpha\} = p_1(x), \] as desired.\(\blacksquare\)

Super-uniformity

The \(p\)-value for any valid test \(\phi_\alpha\) is super-uniform on the null, meaning it is stochastically larger than uniform: \[ \PP_\theta( p(X) \leq \alpha ) \leq \alpha, \text{ for all } \theta\in\Theta_0. \] Note that \(p(x) \leq \alpha\) if and only \(\phi_{\alpha+\ep}(x) = 1\), for all \(\ep>0\). Thus, for \(\theta \in \Theta_0\), we have \[ \begin{aligned} \PP_\theta(p(X) \leq \alpha) &= \PP_\theta\left( \phi_{\alpha+\ep}(X) = 1, \text{ for all } \ep>0 \right)\\ &= \lim_{\ep \downarrow 0} \PP_\theta\left(\phi_{\alpha+\ep}(X) = 1\right)\\ &\leq \lim_{\ep \downarrow 0} \EE_\theta \left[ \phi_{\alpha+\ep}(X)\right]\\ &\leq \alpha \end{aligned} \]

Interpreting the \(p\)-value

One important thing to remember when we interpret the \(p\)-value that it depends on which statistical test we choose (as well as the data, the model, and the null hypothesis). When the null and/or alternative hypothesis are composite, there may be a range of different but justifiable choices of test. In that case, it would be a mistake to think of the \(p\)-value for any one of those tests as the canonical summary of the evidence in the data against the null.

Example: (Multivariate Gaussian) Suppose we observe \(X \sim N_d(\mu, I_d)\) and wish to test the point null \(H_0: \mu = 0\) against the composite alternative \(H_1: \mu \neq 0\). For \(d \geq 1\), the alternative is bi-directional, but most analysts will agree on the standard two-sided test. By constrast, for \(d\geq 2\), the alternative is multidirectional, so there are different tests we could choose depending on our beliefs about which alternatives are more likely than others; the higher the dimension of the problem, the higher the stakes of this choice.

For example, if we want our test to be invariant to the direction \(\frac{\theta}{\|\theta\|}\), we should reject for large values of the two-norm \(\|X\|_2\). But suppose instead we expect \(\theta\) to be sparse if it is nonzero; then \(\|X\|_\infty = \max_{i=1}^d |X_i|\) might be a much better choice. The first test is called the \(\chi^2\) test, because \(\|X\|_2^2\) has a \(\chi_d^2\) distribution under the null, and the second is called the max test; each dominates the other in different sparsity regimes.

The widget below shows the power curves as a function of \(\theta\) when \(\mu\) is a \(k\)-sparse unit vector with equal nonzero entries and total norm \(\|\mu\|_2=\theta\): \[ \mu = \theta \cdot \frac{1}{\sqrt{k}} \binom{1_k}{0_{d-k}}, \] where \(1_n\) and \(0_n\) are respectively the all-ones and all-zeros vectors in \(\RR^n\). By playing with \(d\) and \(k\) you can see that the max-test outperforms the \(\chi^2\) test when \(\mu\) is sufficiently sparse, but the reverse is true if \(\mu\) is dense; and the differences become more pronounced as \(d\) grows larger.

Thus, depending on what test we use on the same data set, we can get very different \(p\)-values.

Confidence Regions

Suppose we are testing \(H_0:\theta = 0\) vs \(H_1:\; \theta \neq 0\). When the \(p\)-value is very small, it gives us a strong indication that the data set we observed is inconsistent with the parameter value \(\theta=0\). It does not necessarily follow that the data set is indicating that the \(\theta\) value is far from zero: for example, if we have a huge data set we might be able to say with very high confidence that \(\theta\) is in the range \([0.0011,0.0012]\). Depending on the context, this could practically amount to confirming the informal scientific null that \(\theta\) is too small to care about. But if we looked at the \(p\)-value it will tell us (accurately) that the formal statistical null \(\theta=0\) is highly implausible.

By the same token, we could also be mistaken if we observe that the \(p\)-value is large and conclude that \(\theta\) must be close to zero. It could be the case that the data set establishes that \(\theta\) is in a narrow range around zero, or it could simply be that the data give very poor evidence about \(\theta\) so we still don’t know very much about it.

Confidence intervals, and more generally confidence regions, are a more reliable guide than \(p\)-values if we want to know what range of \(\theta\) values are plausible in light of the data. As we will see, they can be obtained using the same machinery we have developed for hypothesis tests. Whereas the \(p\)-value tells us what our test would decide for all values of \(k\), we can think of a confidence region as telling us what our test of a point null \(H_0:\; \theta = \theta_0\) vs \(H_1:\; \theta\neq 0\) would decide for all values of \(\theta_0\).

Definition of a confidence region

Definition: We say that \(C(X)\) is a \(1-\alpha\) confidence region for \(g(\theta)\) if:

\[ \PP_\theta(C(X) \ni g(\theta)) \geq 1-\alpha, \quad \text{for all } \theta \in \Theta \] We say that \(C(x)\) covers \(g(\theta)\) if \(C(x) \ni g(\theta)\), and the coverage probability at a parameter value \(\theta\) is \(\PP_\theta(C(X) \ni g(\theta))\). The confidence level of \(C(X)\) is \(\inf_{\theta\in\Theta} \mathbb{P}_\theta(C(X) \ni g(\theta))\).

The use of the “\(\ni\)” symbol in the above definition is deliberate. To avoid common misinterpretations, we should always think and speak of the interval or region \(C(X)\) as the random “subject” of the mathematical sentence and the estimand \(g(\theta)\) as the fixed “object.”

The confidence interval is commonly misinterpreted as a Bayesian guarantee, where \(C(X)\) is first realized and then \(g(\theta)\) has some probability of falling into \(C(X)\). This interpretation is incorrect: confidence intervals are frequentist objects whose guarantee is intended to apply for fixed \(\theta\) values. Once \(X\) is realized \(C(X)\) either does or does not contain the estimand \(g(\theta)\); there is no remaining randomness in the problem.

Thus, while the following formulations are technically mathematically equivalent, the second one tends to give people the wrong impression and is not recommended:

  • Formulation 1 (recommended): \(C(X)\) has a \(95\%\) chance of covering \(g(\theta)\)

  • Formulation 2 (not recommended): \(g(\theta)\) has a \(95\%\) chance of falling in \(C(X)\)

Once we have calculated, say, \(C(X) = [0.8, 1.1]\), it is never correct to say “there is a \(95\%\) chance \(g(\theta) \in [0.8, 1.1]\)” on the basis of a confidence interval guarantee. Under a frequentist model this is a category error since \(g(\theta)\) is not random, and even under a Bayesian model where there is an a posteriori probability, that probability depends on our prior.

Duality of tests and confidence regions

Confidence regions are closely related to hypothesis tests, and one can be constructed from the other, as we see below.

Suppose we have a level \(\alpha\) test \(\phi(X; a)\) of \(H_0: g(\theta) = a\) vs \(H_1: g(\theta) \neq a\), for every value \(a\). Then we can use these tests to construct a (non-randomized) confidence region for \(g(\theta)\) as follows:

\[ C(X) = \{a: \phi(X; a) < 1\} \] That is, \(C(X)\) is all non-rejected values of \(a\). \(C(X)\) is a valid confidence region because \[ \mathbb{P}_\theta(C(X) \ni g(\theta)) = \mathbb{P}_\theta(\phi(X; g(\theta)) < 1) \geq 1-\alpha \] Constructing an interval in this way is called inverting a test (or more precisely, a family of tests, one for each \(a\) value).

Conversely, we can obtain a test by inverting a confidence region. Suppose \(C(X)\) is a \(1-\alpha\) confidence region for \(g(\theta)\). Then \[ \phi(x; a) = 1\{a \notin C(x)\} \] is a valid level-\(\alpha\) test of the null \(H_0: g(\theta) = a\) vs. the alternative \(H_1: g(\theta) \neq a\), because \[ \mathbb{E}_\theta \phi(X; g(\theta)) = \mathbb{P}_\theta(g(\theta) \notin C(X)) \leq \alpha \] A confidence region is called unbiased if its probability of including any value other than the true \(g(\theta)\) is at most \(1-\alpha\): \[ \PP_\theta( C(X) \ni a) \leq 1-\alpha, \quad \text{ for all } a \neq g(\theta). \] It is immediate from the definition that \(C(X)\) is unbiased if and only if the test we obtain by inverting it is unbiased. Likewise, the confidence region obtained by inverting an unbiased non-randomized test is also unbiased.

Example (Multivariate Gaussian confidence ellipse) Suppose that we observe \(X \sim N_d(\mu, \Sigma)\), where the covariance matrix \(\Sigma\) is known and \(\mu\in\RR^d\) is unknown. We can define \[ Z = \Sigma^{-1/2}(X-\mu) \sim N_d(0,I_d). \]

A natural test of the point null \(H_0:\; \mu = \mu_0\) vs \(H_1:\; \mu \neq \mu_0\) is to reject for large values of \[ \|\Sigma^{-1/2} (X-\mu_0)\|^2 \stackrel{H_0}{\sim} \chi_d^2. \] Let \(c_\alpha\) denote the upper \(\alpha\) quantile of the null distribution. The confidence region we obtain by inverting this test is the ellipse: \[ C(X) = \left\{\mu_0:\; \|\Sigma^{-1/2}(X-\mu_0)\|^2 \leq c_\alpha\right\}. \]

The widget below illustrates the process of sampling these confidence regions.

Confidence intervals and confidence bounds

While confidence regions can come in all shapes (with ellipses and rectangles being common in dimensions greater than one), the most common shape in \(\RR\) is an interval or half-interval. When \(C(X) = [C_L(X), C_U(X)] \subseteq \RR\), we call it a confidence interval (CI); when it is of the form \([C_L(X), \infty)\) or \((-\infty, C_U(X)]\) we call \(C_L(X)\) a lower confidence bound (LCB), and \(C_U(X)\) an upper confidence bound (UCB). We typically obtain confidence intervals by inverting a two-sided test of a point null, and confidence bounds by inverting a one-sided test in the appropriate direction.

A confidence bound is called uniformly most accurate (UMA) if it inverts a (non-randomized) UMP test, and a confidence interval is called uniformly most accurate unbiased (UMAU) if it inverts a (non-randomized) UMPU test.

Example (Exponential): As an example, suppose we observe \(X\sim \text{Exp}(\theta)\) and want to construct confidence bounds or a confidence interval for \(\theta\). The cumulative distribution function is \(\PP_\theta(X\leq x) = 1-e^{-x/\theta}\), for \(x>0\).

To find a lower confidence bound, we invert the UMP test of \(H_0:\;\theta \leq \theta_0\) vs \(H_1:\; \theta > \theta_0\), which rejects for large values of \(X\). The cutoff solves for \[ \alpha = \PP_{\theta_0}(X>c_\alpha) = e^{-c_\alpha/\theta_0} \iff c_\alpha = -\theta_0 \log (\alpha). \] To invert this test, observe that the test rejects \(H_0\) if and only if \(X > -\theta_0\log (\alpha)\). Hence, our confidence region is \[ C(X) = \{\theta_0:\; X \leq -\theta_0\log(\alpha)\} = \left[\frac{X}{-\log(\alpha)}, \infty\right), \] giving lower confidence bound \(C_L(X) = \frac{X}{-\log \alpha}\).

By the same token, we can obtain an upper confidence bound by inverting the UMP test for \(H_0:\; \theta \geq \theta_0\) vs \(H_1:\;\theta < \theta_0\). That test rejects when \(X < -\theta_0\log (1-\alpha)\), giving confidence region \[ C(X) = \left\{\theta_0:\; X \geq -\theta_0\log(1-\alpha)\right\} = \left(-\infty, \frac{X}{-\log(1-\alpha)}\right], \] and UCB \(C_U(X)=\frac{X}{-\log(1-\alpha)}\).

To obtain a \(1-\alpha\) confidence interval by inverting the equal-tailed test, we simply intersect the two confidence regions above, both computed at level \(1-\alpha/2\); hence \[ C(X)=\left[\frac{X}{-\log(\alpha/2)}, \frac{X}{-\log(1-\alpha/2)}\right] \]

To invert the UMPU test can be more involved in general, but we can exploit the fact that the exponential is a scale family. If \(c_1\) and \(c_2\) are the left and right cutoff for the UMPU test of \(H_0:\;\theta=1\) vs \(H_0:\;\theta\neq 1\), then the corresponding cutoffs for testing \(H_0:\;\theta=\theta_0\) are \(\theta_0 c_1\) and \(\theta_0 c_2\). Hence, the confidence interval is \[ C(X) = \left\{ \theta_0:\; \theta_0 c_1 \leq X \leq \theta_0 c_2 \right\} = \left[\frac{X}{c_2}, \frac{X}{c_1}\right]. \]

(Mis-)Interpreting Hypothesis Tests

ypothesis tests, and their manifestations as \(p\)-values and confidence intervals, are ubiquitous in science and social science because drawing reliable conclusions from data is a ubiquitous goal. They are tremendously useful tools, but they unfortunately lend themselves to a variety of misinterpretations. For example, it is distressingly common to observe the following reasoning errors implicit in published scientific work:

  1. \(p < 0.05\), therefore there is an effect (which is equal to the point estimate).

  2. \(p > 0.05\), therefore there is no effect

  3. \(p < 10^{-6}\), therefore the effect is huge

  4. \(p < 10^{-6}\), therefore “the data are highly significant” and all point estimates of model parameters can be taken at face value

  5. The CI for the effect size on men is \([0.2, 3.2]\), but for women it is \([-0.2, 2.8]\), therefore there is an effect for men but not for women.

As a broad generalization, confidence intervals tend not to mislead novices as frequently as \(p\)-values or dichotomous accept/reject decisions do. For example, “the effect size was \(1.4\) (\(p = 0.03\))” gives an impression of greater precision than “the effect size was \(1.4\) (CI \([0.14, 2.7]\)).

More subtly, statistical models are almost always abstractions that fail to capture every detail of a real-world scientific setting. It is always important to keep these limitations in mind, but especially so when we are looking to draw some conclusive inference from our data. Many errors arise from the desire of scientists, who may feel shaky in their understanding of statistics, to compartmentalize the statistical analysis from the scientific reasoning it supports.

Unfortunately, interpreting tests can never be made easy or automatic, for the same reason that science can never be made easy or automatic. Hypothesis tests are a tool for critical thinking, not a substitute for it: they let us ask specific questions, in specific ways, under specific modeling assumptions, and all of the choices we make along the way must be justified as a part of any argument that we ultimately want to make for a scientific conclusion. When scientists report some scientific claim supported by a \(p\)-value, but omit any description of what analysis was performed or what assumptions were made in order to produce the \(p\)-value, their arguments are inherently incomplete.

Conceptual objections to hypothesis testing

It is an uncontroversial view among statisticians that hypothesis tests are often used carelessly, leading to unsound conclusions. More controversially, some statisticians are skeptical that hypothesis tests can ever be a conceptually sound tool for statistical analysis even when they are interpreted correctly. The following objections are commonly heard:

Objection 1: Point nulls are unrealistic

Some critics ask why we should ever bother to test \(\theta = 0\). In practice, they say, no effect size is ever equal to zero, so we learn nothing by using the data to establish that \(\theta \neq 0\).

There are at least two good answers to this objection. The most direct answer, at least in parametric problems, is that if we don’t want to test the point null we have plenty of other options, for example:

  1. We can always test a different null, such as \(H_0:\;|\theta|\leq \delta\), for some value \(\delta>0\).

  2. We can interpret the two-sided test of \(H_0:\;\theta=0\) as two one-sided tests of \(H_0^1:\;\theta\geq 0\) and \(H_0^2:\;\theta\leq 0\), with Type I errors of \(\alpha_1\) and \(\alpha_2\), respectively. That is, if \(T(X)>c_2\) we can reject \(H_0^2\) and conclude \(\theta>0\), and if \(T(X)<c_1\) we can reject \(H_0^1\) and conclude \(\theta<0\). Because \(\alpha_1+\alpha_2=\alpha\), the likelihood that either test makes a Type I error – and thus the probability that we make a false claim about the sign of \(\theta\) – is below \(\alpha\). On this interpretation, we never reject the point null \(H_0\) without learning the sign of \(\theta\).

  3. We can usually invert a test of the point null to obtain a CI for \(\theta\)

The objection to exact nulls can be harder to answer in non-parametric problems, such as in nonparametric two-sample tests that formally test the hypothesis that two distributions \(P\) and \(Q\) are identical to each other. Depending on what test we use, we may be able to justify drawing a more specific conclusion from a rejection (e.g., that the median of \(P\) is greater or less than the median of \(Q\)). On the other hand, there may not be other good options for nonparametric analysis of the data; in particular, using Bayesian nonparametric models requires making much stronger assumptions.

Objection 2: Frequentist methods answer the wrong question

A second objection is that frequentist methods like hypothesis tests are at best a clever evasion of the questions scientists really want to answer, and at worst a deceptive bait and switch. For example, instead of telling scientists what they really want to know, the likelihood that \(H_0\) is true in light of the data, frequentists instead calculate a \(p\)-value, which is the probability the data would be as extreme as observed given that the null is true. They implicitly hope scientists don’t dwell too much on the difference between these two probabilities — and then blame the scientists when they can’t keep it straight!

Many practitioners are chronically confused about what confidence intervals mean and don’t mean: any good lesson about CIs (including this one) must include copious warnings about what we can’t say about them. Again, this is because what scientists really want to be able to do is calculate a confidence interval and say that the estimand is probably in the interval; but we can only make such a probability statement about a CI before we have calculated it (or, having calculated it, we can only make probability statements about the CI in a new experiment)

These objections cannot be dismissed as easily as the first objection; these really are drawbacks of frequentist methods. But Bayesian alternatives have drawbacks of their own, especially that they require the analyst to supply his or her own opinions about every aspect of the problem, including:

  1. The probability that the null is true (which is the very question that we are trying to answer), and

  2. The probability distribution over alternative values (which the frequentist very often does not need to worry about, because they can use a UMP or other canonical choice of test).

One major advantage of hypothesis tests and other frequentist methods that has led to their abiding popularity in scientific data analysis is their versatility and applicability in settings where we wish to be parsimonious with our assumptions.