Scheffe's Method for Multiple Comparisons
Table of Contents
Introduction
I’ve written previously about using the Bonferroni correction for the multiple comparisons problem. To correct for $k$ comparisons, we simply divide our desired Type-I error rate threshold, $\alpha$, by $k$. Thus, for each comparison, we call it statistically significant if the p-value is less than $\alpha / k$. To calculate simultaneous $100 (1 - \alpha)\%$ confidence intervals, we calculate confidence intervals that individually have coverage $100 (1 - \alpha/k) \%$. This procedure could hardly be any simpler, but it has a downside: it can be overly conservative, and dramatically reduces our power. In trying to avoid Type-I errors (erroneously rejecting the null hypothesis and claiming an effect when the data do not support such a conclusion), we reduce our ability to identify real effects. That’s throwing the baby out with the bath water!
While the Bonferroni correction is without a doubt the simplest way to correct for multiple comparisons, it is not the only way. I often hear people advocate for controlling the False Discovery Rate, for example by using the Benjamini-Hochberg procedure, but there is another much less well known procedure I think warrants consideration. Scheffé’s method (named after Henry Scheffé) allows us to construct simultaneous confidence intervals on arbitrarily many functions of the model parameters.
Scheffé’s Method for Linear Models
It starts with a confidence region on the model parameters. I’ll start with a simple example from linear regression. Suppose $y = X \beta + \epsilon$, with $\beta \in \mathbb{R}^p$, and $\epsilon^{(i)} \sim N(0, \sigma^2)$. I’ll assume that all of the assumptions of linear regression are satisfied: $x^{(i)} \perp \! \! \! \perp \epsilon^{(i)}$; observations are IID; design matrix $X$ is full rank. I’ll additionally assume that $\sigma^2$ is known, just for simplicity. Then the estimated model parameters $\hat{\beta} = (X^T X)^{-1} X^T y$ have the distribution: $\hat{\beta} \sim N (\beta, \mathcal{I}^{-1})$, where $\mathcal{I} = (X^T X) / \sigma^2$ is the Fisher information.
Since $\mathcal{I}^{1/2} (\hat{\beta} - \beta) \sim N(0, I)$, and since $z^T z \sim \chi_p^2$ when $z \sim N(0, I)$ and $z \in \mathbb{R}^p$, then $C_\alpha := \{ \beta^\prime : (\hat{\beta} - \beta^\prime)^T \mathcal{I} (\hat{\beta} - \beta^\prime) \leq \chi_{\alpha; p}^2 \}$ is a confidence region on $\beta$, where $\chi_{\alpha; p}^2$ is the upper $1 - \alpha$ percentile of a $\chi^2$ distribution with $p$ degrees of freedom. The confidence region has the property that $P\{ \beta \in C_\alpha \} \geq 1 - \alpha$. (Actually, in this case, the coverage is exact; in other cases an exact confidence region may be unavailable, so we need a conservative interval.)
Suppose we wish to compute a confidence interval on some function of the model parameters, $f(\beta)$. Define $\ell = \inf_{\beta^\prime \in C_\alpha} f(\beta^\prime)$, and $u = \sup_{\beta^\prime \in C_\alpha} f(\beta^\prime)$. Then $(\ell, u)$ is a $100 ( 1 - \alpha )\%$ confidence interval on $f(\beta)$. Proof:
\begin{align*} P \{ \ell \leq f(\beta) \leq u \} &= P \{ \ell \leq f(\beta) \leq u \; | \; \beta \in C_\alpha \} \cdot P \{ \beta \in C_\alpha \} \\ &\phantom{=} + P \{ \ell \leq f(\beta) \leq u \; | \; \beta \notin C_\alpha \} \cdot P \{ \beta \notin C_\alpha \} \\ &\geq P \{ \ell \leq f(\beta) \leq u \; | \; \beta \in C_\alpha \} \cdot P \{ \beta \in C_\alpha \} \\ &\geq (1 - \alpha) \cdot P \{ \ell \leq f(\beta) \leq u \; | \; \beta \in C_\alpha \} \\ &= 1 - \alpha. \end{align*} The first line is simply the law of total probability; the last line follows from how we’ve defined $\ell$ and $u$. The key insight here is: given the model parameters are in their confidence region, $f(\beta)$ is guaranteed to be in its confidence interval, because we have defined the limits to be the smallest and largest values possible given that constraint.
Moreover, $(\ell, u)$ is a simultaneous $100 ( 1 - \alpha )\%$ confidence interval on $f(\beta)$. To see this, redefine $\ell \to \ell_f$, $u \to u_f$, and consider a similar confidence interval on $g(\beta)$: $(\ell_g, u_g)$, where $\ell_g := \inf_{\beta^\prime \in C_\alpha} g(\beta^\prime)$, and $u_g = \sup_{\beta^\prime \in C_\alpha} g(\beta^\prime)$. Then \begin{align*} P \{ \ell_f \leq f(\beta) \leq u_f \textrm{ and } \ell_g \leq g(\beta) \leq u_g \} \\ \geq P \{ \ell_f \leq f(\beta) \leq u_f \textrm{ and } \ell_g \leq g(\beta) \leq u_g \; | \; \beta \in C_\alpha \} \\ \times P \{ \beta \in C_\alpha \} \\ \geq 1 - \alpha. \end{align*} Again: given that $\beta \in C_\alpha$, $f(\beta)$ and $g(\beta)$ are guaranteed to be in their respective intervals, since that’s how we’ve defined $\ell$ and $u$. The coverage of the confidence intervals is inherited from the confidence region.
Calculating the Interval
Scheffé’s method is only practical when (a) we have a confidence region on the model parameters, and (b) we can compute the infimum and supremum of the desired function over the confidence region. For (b), when the confidence region is convex, and $f$ is linear, computing both the infimum and the supremum are a pair of convex optimization problems. For example, to compute $\ell$, we solve: $$ \begin{array}{cl} \mbox{minimize} & c^T \beta \\\ \mbox{subject to} & (\hat{\beta} - \beta)^T \mathcal{I} (\hat{\beta} - \beta) \leq \chi_{\alpha; p}^2 \\\ \end{array} $$ where the variable is $\beta$ and the other terms are data. We can actually solve this problem analytically. Let $z = \mathcal{I}^{1/2} (\hat{\beta} - \beta) / \sqrt{\chi_{\alpha; p}^2}$. Then the problem above is equivalent to $$ \begin{array}{cl} \mbox{minimize} & c^T \hat{\beta} - \sqrt{\chi_{\alpha; p}^2} \, c^T \mathcal{I}^{-1/2} z \\\ \mbox{subject to} & z^T z \leq 1 \\\ \end{array} $$ The objective is minimized when $z$ is “in the same direction” as $\mathcal{I}^{-1/2} c$, so $z^\star = \mathcal{I}^{-1/2} c / \| \mathcal{I}^{-1/2} c \|_2$, and \begin{align*} \ell &= c^T \hat{\beta} - \sqrt{\chi_{\alpha; p}^2} \frac{c^T \mathcal{I}^{-1} c}{\| \mathcal{I}^{-1/2} c \|_2} \\ &= c^T \hat{\beta} - \sqrt{\chi_{\alpha; p}^2} \, \| \mathcal{I}^{-1/2} c \|_2 \end{align*}
A similar line of reasoning reveals that $$ u = c^T \hat{\beta} + \sqrt{\chi_{\alpha; p}^2} \, \| \mathcal{I}^{-1/2} c \|_2. $$ In summary, $c^T \hat{\beta} \pm \sqrt{\chi_{\alpha; p}^2} \, \| \mathcal{I}^{-1/2} c \|_2$ are the endpoints of simultaneous $100 (1 - \alpha) \%$ confidence intervals on $c^T \beta$, for arbitrarily many $c$.
Compare this with the Bonferroni-based interval with endpoints $c^T \hat{\beta} \pm z_{\alpha / k} \| \mathcal{I}^{-1/2} c \|_2$, where $z_{\alpha / k}$ is the upper percentile of a Gaussian distribution, correcting for $k$ comparisons. Let $k^\ast$ be the smallest integer such that $z_{\alpha / k^\ast} \geq \sqrt{\chi_{\alpha; p}^2}$. If we are making fewer than $k^\ast$ comparisons, we’re better off using the Bonferroni correction; otherwise we’re better off using Scheffé’s method.
Generalizations
Scheffé’s method was originally devised for the situation where $\sigma^2$ is unknown and must therefore be estimated. In this case, the main thing that changes is the confidence region on $\beta$: $$ C_\alpha = \left\{ \beta^\prime : \frac{1}{p}(\hat{\beta} - \beta^\prime)^T \hat{\mathcal{I}} (\hat{\beta} - \beta^\prime) \leq F_{\alpha; n-p}^p \right\}, $$ where $F_{\alpha; n-p}^p$ is the upper percentile of Snedecor’s $F$ distribution with $p$ and $n-p$ degrees of freedom in the numerator and denominator, respectively; and $\hat{\mathcal{I}} = X^T X / \hat{\sigma}^2$ with $\hat{\sigma}^2 := \| y - X \hat{\beta} \|^2 / (n - p)$. I’ll leave it as an exercise to the reader to show that $c^T \hat{\beta} \pm \sqrt{p \cdot F_{\alpha; n-p}^p} \, \| \hat{\mathcal{I}}^{-1/2} c \|_2$ are the endpoints of simultaneous $100 (1 - \alpha) \%$ confidence intervals on $c^T \beta$.
Although mathematically valid for any function $f(\beta)$, we stated the method is only practical when we can actually calculate $\ell$ and $u$ by solving the corresponding optimization problems. Actually, linear functions are not the only family of functions for which this is the case. If $f(\beta) = \frac{c^T \beta + d}{a^T \beta + b}$, then $f$ is quasilinear and we can efficiently compute the infimum and supremum. I often work with relative differences in parameters, like $(\beta_2 - \beta_1) / \beta_1$, so this is a useful extension. There is no analytical form for the resulting interval, however; we must use numerical techniques, as described in Boyd and Vanderberghe’s Convex Optimization book.
Even more generally, methods from duality can be used to find “lower bounds on the lower bound”. That’s because the value of the dual problem is always less than or equal to the value of the primal problem, and we can often solve the dual problem efficiently even when the primal problem is nonconvex.
We are not limited to linear regression models, either. In a generalized linear model, $C_\alpha := \{ \beta^\prime : 2 \ell \ell(\hat{\beta}) - 2 \ell \ell(\beta^\prime) \leq \chi_{\alpha; p}^2 \}$ is an asymptotic confidence region on $\beta$, where $\ell \ell$ is the log likelihood. Similar to the distinction between $\sigma^2$ being known or unknown, generalized linear models typically have a dispersion parameter that may be unknown. In that case, $\ell \ell$ will be unknown, but we’re still able to calculate the deviance, and form a confidence interval based on the $F$ distribution: $$ C_\alpha := \left\{ \beta^\prime : \frac{(\ell \ell(\hat{\beta}) - \ell \ell(\beta^\prime)) / p}{\ell \ell(\hat{\beta}) / (n - p)} \leq F_{\alpha; n-p}^p \right\}. $$
Scheffé’s method can also be used to assess statistical significance. For a null hypothesis $f(\beta) = d$: if $d \notin (\ell, u)$, reject. As with the Bonferroni correction, I don’t think it makes sense to change the p-values themselves; just which p-values get labeled as statistically significant.
Comparison with Other Methods
In short, Scheffé’s method is powerful and broadly applicable. So why not use it all the time? Let’s go back to the simple linear regression case. Scheffé’s method is only better than the Bonferroni correction if $z_{\alpha / k} \geq \sqrt{\chi_{\alpha; p}^2}$. When the number of model parameters is large, so too will be $\chi_{\alpha; p}^2$. When we have a model with lots of parameters, and we only care about a few comparisons, Bonferroni will be better (and indeed, it’s easy to check which one is better in any given circumstance; I leave that to the reader).
When we have a model with lots of parameters and we are interested in a lot of comparisons, neither method will work well. Then and only then do I think it makes sense to turn to the False Discovery Rate and the Benjamini-Hochberg procedure. Why? Because when we control the False Discovery Rate, we’re moving the goal posts, and I don’t think that’s something to do lightly. I think this procedure makes the most sense in a multi-phase testing approach, where we start with thousands of hypotheses and use FDR to “whittle down” the list to a small collection of hypotheses corresponding to likely real effects. Then we do another experiment focusing just on those hypotheses, using the Bonferroni correction (since we’re assuming the number of parameters is too large for Scheffé’s method to be effective).
Where to Read More
Seber and Lee is my go-to reference for Linear Regression, and discusses Scheffé’s method as well as some other options for simultaneous intervals. McCullagh and Nelder’s Generalized Linear Models is my all-time favorite statistics book, and discusses the $\chi^2$-based confidence region for generalized linear models, while Simon Wood’s Generalized Additive Models discusses using the $F$ distribution when the dispersion parameter is unknown. The last two references both provide important caveats on when these regions may be considered valid. Boyd and Vandenberghe is my go-to for Convex Optimization.