Robust Power Assessment
An important part of planning any statistical experiment is power analysis. In this post I will focus on power analysis for linear regression models, but I am hopeful much of this can be applied to Generalized Linear Models and hence to the sorts of A/B tests I normally run.
Preliminaries
Suppose $y = X \beta + \epsilon$ where $y \in \mathbb{R}^n$, $\beta \in \mathbb{R}^p$, and all of the assumptions of linear regression are met, as described in our last post on Scheffé’s method. Suppose we are interested in testing a null hypothesis of the form $C \beta = d$, where $d \in \mathbb{R}^q$. Let $$ T_C(d) := \frac{1}{q} \left( C \hat{\beta} - d \right)^T \left( C \hat{\mathcal{I}}^{-1} C^T \right)^{-1} \left( C \hat{\beta} - d \right) = \frac{1}{q} \| C \hat{\beta} - d \|_{C \hat{\mathcal{I}} C^T}^2, $$ where $\hat{\beta} = (X^T X)^{-1} X^T y$ is the least squares estimate of $\beta$, and $\hat{\mathcal{I}} = X^T X / \hat{\sigma}^2$ be the estimated Fisher information, with $\hat{\sigma}^2 = \| y - X \hat{\beta} \|_2^2 / (n - p)$ the estimated variance of $\epsilon$.
As derived in [Wood17, §1.3.4], under the null hypothesis, $T_C(d) \sim F_{n-p}^q$, where $F_{n-p}^q$ is Snedecor’s $F$ distribution with $q$ and $n-p$ degrees of freedom in the numerator and denominator, respectively. Thus, $1 - \Phi_{F_{n-p}^q}(T_C)$ is a p-value against the null hypothesis $C \beta = d$, where $\Phi_{F_{n-p}^q}(\cdot)$ is the cumulative distribution function for $F_{n-p}^q$.
Under an alternative hypothesis, such as $C \beta = d^\prime$, $T_C(d)$ has a non-central $F$ distribution with non-centrality parameter $\lambda = \| d^\prime - d \|_{C \mathcal{I}^{-1} C^T}^2$, where $\mathcal{I} = X^T X / \sigma^2$ is the true Fisher information (that is, using the true variance $\sigma^2$ rather than the estimated $\hat{\sigma}^2$). For a derivation, see my notes on Linear Regression. Thus the power of the test against the alternative when rejecting at level $\alpha$ is $$ 1 - \Phi_{F_{n-p; \lambda}^q}\left( F_{n-p}^{q; 1 - \alpha} \right) $$ where $F_{n-p}^{q; 1 - \alpha}$ is the upper quantile of a (central) $F_{n-p}^q$ distribution.
Robust Power Assessment
The most obvious problem we run into when assessing power is: what alternative hypothesis should we choose? A slightly less obvious problem is: what should we assume for $\sigma^2$?
The first option is to use point estimates of $C \beta$ and $\sigma^2$, perhaps from previous experiments. A more robust approach, and the focus of this post, is to use the values in a neighborhood of the point estimates resulting in the worst-case power. We can then be confident that small errors in the point estimates do not derail our experiment.
The problem we solve is: $$ \begin{array}{cl} \mbox{minimize} & 1 - \Phi_{F_{n-p; \lambda}^q}\left( F_{n-p}^{q; 1 - \alpha} \right) \\\ \mbox{subject to} & \beta \in S_\beta \\\ & \sigma^2 \in S_{\sigma^2} \end{array} $$ where $S_\beta$ and $S_{\sigma^2}$ are confidence regions on $\beta$ and $\sigma^2$, respectively (to be discussed below). The variables in this problem are $\beta$ and $\sigma^2$, which enter the objective function via the non-centrality parameter $\lambda$.
First note that the objective is an increasing function of $\lambda$. This follows from the nature of the non-central $F$ distribution: the larger the non-centrality parameter, the easier it is to reject the null hypothesis and the greater the power will be. So an equivalent problem is: $$ \begin{array}{cl} \mbox{minimize} & \lambda = \| d^\prime - d \|_{C \mathcal{I}^{-1} C^T}^2 \\\ \mbox{subject to} & \beta \in S_\beta \\\ & \sigma^2 \in S_{\sigma^2}, \end{array} $$ which can be more conveniently expressed as $$ \begin{array}{cl} \mbox{minimize} & \frac{( C \beta - d )^T \left( C (X^T X)^{-1} C^T \right)^{-1} ( C \beta - d )}{\sigma^2} \\\ \mbox{subject to} & \beta \in S_\beta \\\ & \sigma^2 \in S_{\sigma^2}. \end{array} $$ The objective is a convex (quadratic-over-linear) function of $(\beta, \sigma^2)$, so if $S_\beta$ and $S_{\sigma^2}$ are convex regions, we have a convex optimization problem that can be efficiently solved.
If we have data from a previous experiment, then $$ S_\beta(\alpha) = \left\{ \beta^\prime : (\hat{\beta}_{\textrm{prev}} - \beta^\prime)^T X_{\textrm{prev}}^T X_{\textrm{prev}} (\hat{\beta}_{\textrm{prev}} - \beta^\prime) \leq \sigma^2 \cdot \chi_{p}^{2; 1 - \alpha} \right\} $$ is such a region for $\beta$. (It may be possible to get a tighter region, especially when $q \ll p$, but I leave that to the reader to figure out.) Similarly, $$ S_{\sigma^2}(\alpha) = \left\{ \sigma^{\prime 2} : 0 \leq \sigma^{\prime 2} \leq \frac{(n_{\textrm{prev}}-p) \hat{\sigma}_{\textrm{prev}}^2 }{ \chi_{n_{\textrm{prev}}-p}^{2; 1 - \alpha} } \right\}, $$ where $\chi_{n_{\textrm{prev}}-p}^{2; 1 - \alpha}$ is the upper quantile of a $\chi_{n_{\textrm{prev}}-p}^2$ distribution.
With these regions, the problem becomes $$ \begin{array}{cl} \mbox{minimize} & \frac{( C \beta - d )^T \left( C (X^T X)^{-1} C^T \right)^{-1} ( C \beta - d )}{\sigma^2} \\\ \mbox{subject to} & \| X_{\textrm{prev}} (\hat{\beta}_{\textrm{prev}} - \beta) \|_2^2 \leq \sigma^2 \cdot \chi_{p}^{2; 1 - \alpha} \\\ & 0 \leq \sigma^2 \leq \frac{(n_{\textrm{prev}}-p) \hat{\sigma}_{\textrm{prev}}^2 }{ \chi_{n_{\textrm{prev}}-p}^{2; 1 - \alpha} }, \end{array} $$ with variables $\beta$ and $\sigma^2$; the other parameters are data. The value of this problem, $\lambda^\star$, is the worst-case noncentrality parameter, which in turn can be plugged into the power formula to calculate the worst-case power.
In this case, it is plain that the optimal $\sigma^2$ is the one equal to its upper bound. The problem is then a quadratic program with a quadratic inequality constraint.
Probabilistic Interpretation
When we use these regions for $\beta$ and $\sigma^2$, our robust approach has a probabilistic interpretation. Let $\lambda^\star$ be the worst-case non-centrality parameter. Then $P\{ \lambda \geq \lambda^\star \} \geq (1 - \alpha)^2 \geq 1 - 2 \alpha$ using a similar derivation as we used with Scheffé’s method. This then provides a method of guaranteeing a certain minimum power with any probability we wish.
Proof: $$ \begin{align*} P\{ \lambda \geq \lambda^\star \} &\geq P\{ \lambda \geq \lambda^\star \; | \; \beta \in S_\beta(\alpha), \sigma^2 \in S_{\sigma^2}(\alpha) \} \\ &\phantom{\geq} \times P\{ \beta \in S_\beta(\alpha), \sigma^2 \in S_{\sigma^2}(\alpha) \} \\ &= 1 \cdot P\{ \beta \in S_\beta(\alpha) \} \cdot P\{ \sigma^2 \in S_{\sigma^2}(\alpha) \} \\ &= (1 - \alpha) \cdot (1 - \alpha). \end{align*} $$ The key insight here is that the confidence regions on $\beta$ and $\sigma^2$ are independent since $S_\beta(\alpha)$ is based solely on $\hat{\beta}_{\textrm{prev}}$, and $S_{\sigma^2}(\alpha)$ is based solely on $\hat{\sigma}_{\textrm{prev}}^2$. An important result in linear regression is that $\hat{\beta}_{\textrm{prev}}$ is statistically independent of $\hat{\sigma}_{\textrm{prev}}^2$.
Post-Hoc Power Analysis
Sometimes I see people calculate power after performing an A/B test. This is bad! Post-hoc power analysis tends to exaggerate our confidence in the sensitivity of the test. The robust approach outlined above mitigates this at least partially, and could be a viable approach for post-hoc power analysis.
References
[Wood17] Simon N. Wood, “Generalized Additive Models: An Introduction with R”. Chapman & Hall, 2nd edition, 2017.