Design-Based Inference and Sensitivity Analysis for Survey Sampling

1 Introduction
In this note, we consider sampling from a finite population, without replacement and with unequal probabilities. We seek an estimate of the population mean of some characteristic, called the outcome. When individuals are sampled with known first and second order inclusion probabilities, (Horvitz and Thompson 1952) derived an unbiased estimate of the mean and an unbiased estimate of the variance of that estimator.
If the sample is small relative to the population, we can bypass the need for second-order inclusion probabilities by positing a sampling-with-replacement model. This model is still unbiased for the population mean, and the estimated variance is conservative, but with large sample sizes it is a reasonable approximation (Lohr 2010, § 6.4.1; Durbin 1953).
These derivations are “design-based” in the sense that focus is centered on the sampling mechanism. In contrast, model-based analyses posit a model for the outcome, often by reference to a super-population from which the finite population is drawn (Cassel, Särndal, and Wretman 1977). Design-based methods are valuable when we’re in control of the sampling mechanism (and thus understand it perfectly), because they make no assumption about the outcome. Model-assisted techniques leverage both design-based and model-based approaches (Särndal, Swensson, and Wretman 2003).
In many situations, we do not control the sampling mechanism but rather have a “convenience sample”. When we have access to a set of covariates for all individuals in the population, we can fit a model that estimates the probability any individual is included in the sample. If this model consistently estimates the sampling propensities, the Horvitz-Thompson estimator is asymptotically unbiased and we may still estimate its variance with large sample sizes.
With medium sample sizes, we can assess the sensitivity of the estimate to small errors in the estimated sampling propensities. We do this by positing the true sampling propensities lie in a bounded interval centered around the estimated propensity and then solving an optimization problem (Fogarty 2023; Zhao, Small, and Bhattacharya 2019). This gives a range of point estimates that capture the uncertainty associated with the estimated propensity scores. We can incorporate sampling variability, expanding the Horvitz-Thompson confidence intervals and adjusting p-values upward to account for the additional uncertainty from estimated propensity scores.
Sensitivity analysis can also be used to identify estimators that are more robust to errors in the estimated sampling propensities. We show that a “doubly-robust” estimator not only increases robustness, but also decreases variance.
When we have multiple estimates for the same underlying population mean, these estimates may contradict one another and foster confusion and distrust. Multiple samples conducted by different mechanisms provides information about the adequacy of adjustment methods. Armed with this knowledge, a meta-analysis can combine multiple samples to provide a single estimate that acknowledges the associated uncertainty.
2 Sampling with unequal probabilities
The population consists of \( N \) individuals with some outcome \( y. \) The population average is \( \bar{y} := \frac{1}{N} \sum_{i=1}^N y_i. \) The sample consists of \( n \) individuals, drawn without replacement from the population with known but unequal probabilities \( \pi_i. \) Let \( Z_i \) be 1 if individual \( i \) is included in the sample and 0 otherwise, so \( \mathrm{Pr}\{ Z_i = 1 \} = \pi_i. \)
Consider estimators of the form: \[ \hat{Y}_\textrm{HT} = \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot y_i, \] where \( \mathcal{R} \) denotes the sampled indices and \( w_i \) is a weight applied to individual \( i. \)
(Horvitz and Thompson 1952) showed that if \( w_i = \frac{n/N}{\pi_i}, \) then \( \mathbf{E}\left[\hat{Y}_\textrm{HT} \right] = \bar{y}. \) To see this, write \( \hat{Y}_\textrm{HT} \) as: \[ \hat{Y}_\textrm{HT} = \frac{1}{n} \sum_{i = 1}^N Z_i \cdot w_i \cdot y_i. \] Then
\begin{align*} \mathbf{E}\left[\hat{Y}_\textrm{HT}\right] &= \frac{1}{n} \sum_{i = 1}^N \mathbf{E}[Z_i] \cdot w_i \cdot y_i \\ &= \frac{1}{n} \sum_{i = 1}^N \pi_i \cdot w_i \cdot y_i \\ &= \frac{1}{N} \sum_{i = 1}^N y_i \\ &= \bar{y}. \end{align*}
(Horvitz and Thompson 1952) also derived the variance of this estimator, and an unbiased estimate of this variance, but their derivation required the “second order inclusion probabilities,” \( \pi_{ij} := \mathrm{Pr}\{ Z_i = 1 \textrm{ and } Z_j = 1 \} \) which are often unavailable.
(Durbin 1953) proposed instead pretending the sample was drawn with replacement from the population. When the sample is small relative to the population, this is a reasonable approximation because sampling with replacement from a large population is unlikely to draw the same individual twice in a small sample.
The Appendix derives the variance of \( \hat{Y}_\textrm{HT} \) under a sampling-with-replacement model: \[ \mathrm{Var}\left(\hat{Y}_\textrm{HT}\right) = \frac{1}{n \cdot N} \cdot \sum_{i=1}^N \frac{1}{w_i} \left( w_i \cdot y_i - \bar{y} \right)^2. \] The Appendix also derives an unbiased estimate of this variance suitable for practical application: \[ \widehat{\mathrm{Var}}\left(\hat{Y}_\textrm{HT}\right) = \frac{1}{n \cdot (n-1)} \cdot \sum_{i \in \mathcal{R}} \left( w_i \cdot y_i - \hat{Y}_\textrm{HT} \right)^2. \]
Suppose we wanted to test the one-sided null hypothesis \( H_0: \bar{y} = \bar{y}_0 \) vs the alternative \( \bar{y} > \bar{y}_0. \) Under the null hypothesis, \[ t := \frac{\hat{Y}_\textrm{HT} - \bar{y}_0}{\sqrt{\widehat{\mathrm{Var}}\left(\hat{Y}_\textrm{HT}\right)}} \] has an asymptotic standard normal distribution, so \( p_> = 1 - \Phi(t) \) is a one-sided p-value, where \( \Phi \) is the cumulative distribution function for a standard normal. A one-sided \( 100 (1 - \alpha)\% \) confidence interval is the set \( \{ \bar{y}_0 \, : \, p_> \geq \alpha \}. \) This is equivalent to the set \( \{ \bar{y}_0 \, : \, t \leq \Phi^{-1}(1 - \alpha) \} \) or \( \bar{y}_0 \geq \hat{Y}_\textrm{HT} - \Phi^{-1}(1 - \alpha) \cdot \sqrt{\widehat{\mathrm{Var}}\left( \hat{Y}_\textrm{HT} \right)}. \)
Similarly, to test the one-sided null hypothesis \( H_0: \bar{y} = \bar{y}_0 \) vs the alternative \( \bar{y} < \bar{y}_0, \) we would calculate a p-value \( p_< = \Phi(t) \) and an upper bound on a one-sided interval as \( \hat{Y}_\textrm{HT} - \Phi^{-1}(\alpha) \cdot \sqrt{\widehat{\mathrm{Var}}\left( \hat{Y}_\textrm{HT} \right)} \) or the more recognizable \( \hat{Y}_\textrm{HT} + \Phi^{-1}(1 - \alpha) \cdot \sqrt{\widehat{\mathrm{Var}}\left( \hat{Y}_\textrm{HT} \right)}. \)
To calculate a two-sided p-value, double the smaller one-sided p-value: \( p_{<>} = 2 \cdot \mathrm{min}(p_>, p_<). \) For a two-sided \( 100 (1 - \alpha)\% \) confidence interval, take the intersection of the two, one-sided \( 100 (1 - \alpha/2)\% \) intervals, with endpoints \( \hat{Y}_\textrm{HT} \pm \Phi^{-1}(1 - \alpha/2) \cdot \sqrt{\widehat{\mathrm{Var}}\left(\hat{Y}_\textrm{HT} \right)}. \)
In some instances, the total population size, \( N, \) is unknown. (Särndal, Swensson, and Wretman 2003, § 5.7) discuss estimating the population mean as \[ \hat{Y}_\textrm{WA} = \frac{\sum_{i \in \mathcal{R}} w_i \cdot y_i }{\sum_{i \in \mathcal{R}} w_i}, \] where \( w_i = 1 / \pi_i. \) When the weights are scaled such that \( w_i \propto 1/\pi_i \) and \( \sum_{i \in \mathcal{R}} w_i = n, \) \( \hat{Y}_\textrm{HT} \) is a true weighted average of the outcomes, and \( \hat{Y}_\textrm{HT} = \hat{Y}_\textrm{WA}. \)
(Särndal, Swensson, and Wretman 2003) provide reasons to prefer \( \hat{Y}_\textrm{WA} \) even when the population size, \( N, \) is known, most compellingly stability when the outcome is correlated with the response propensities. In that case, setting \( w_i = (n/N) / \pi_i \) clearly results in the same \( \hat{Y}_\textrm{WA} \) and fosters alignment with \( \hat{Y}_\textrm{HT}. \) The variance they provide is based on second-order inclusion probabilities, but I expect in large sample sizes, the variance is the same as the formulae used above. We cite this work to justify why we might prefer a true weighted average.
3 Sensitivity analysis
Often we do not know the true sampling propensities, \( \pi_i, \) but instead have estimates, \( \hat{\pi}_i. \) Often these estimates are based on fitting a model to the sampled individuals, but then this model depends on the sampling mechanism and thus affects the variance calculation. With large samples and a rich set of covariates, we may feel comfortable assuming \( \hat{\pi} \approx \pi, \) with negligible uncertainty, in which case we can use the methods of the last section.
A sensitivity analysis instead assumes the true sampling propensities are close to the estimated scores. (Fogarty 2023) uses a model: \[ \frac{1}{\Gamma} \leq \frac{\pi_i / (1 - \pi_i)}{\hat{\pi}_i / (1 - \hat{\pi}_i)} \leq \Gamma \] for all \( i, \) where \( \Gamma \geq 1 \) is the sensitivity parameter. Call the set of sampling propensities, \( \{ \pi_i \}, \) satisfying these bounds a \( \mathit{\Gamma} \)-region.
When \( \Gamma = 1, \) the model assumes that \( \pi_i = \hat{\pi}_i \) for all \( i. \) As we consider larger and larger values of \( \Gamma, \) we entertain less and less reliable estimates. Typically we would not know how accurate our estimates are, and thus cannot choose a sensible value for \( \Gamma, \) but not performing sensitivity analysis is equivalent to assuming \( \Gamma = 1, \) which is the least conservative assumption possible. Instead we can explore a variety of values of \( \Gamma \) to investigate the impact of estimation errors on study conclusions.
Taking logarithms, the model is equivalent to: \[ -\gamma \leq \mathrm{logit}(\pi_i) - \mathrm{logit}(\hat{\pi}_i) \leq \gamma, \] where \( \gamma = \log{\Gamma}, \) or \[ \mathrm{logit}(\pi_i) = \mathrm{logit}(\hat{\pi}_i) + \tau_i, \] where \( |\tau_i| \leq \gamma. \) Using the logistic scale means an error of \( \tau \) has the same interpretation regardless of \( \pi_i. \) In contrast, an absolute error of 0.01 has a different interpretation against \( \pi_i = 0.01 \) (big error) versus \( \pi_i = 0.5 \) (negligible error).
Letting \( \hat{w}_i = \frac{n/N}{\hat{\pi_i}} \) and \( w_i = \frac{n/N}{\pi_i}, \) the bounds on \( \pi_i \) translate into bounds on \( w_i: \) \[ \frac{1}{\Gamma} \hat{w}_i + \frac{n}{N} \left( 1 - \frac{1}{\Gamma} \right) \leq w_i \leq \Gamma \hat{w}_i - \frac{n}{N} \left( \Gamma - 1 \right). \] Through an abuse of terminology, we also use “\( \Gamma \)-region” to refer to the set of weights satisfying these bounds.
If we knew the “correct” weights, we could use them to calculate \( \hat{Y}_\textrm{HT}. \) Instead we calculate bounds on the point estimate as extrema over the \( \Gamma \)-region:
\begin{align*} \hat{Y}_\textrm{HT}^\textrm{min} &= \inf_{w^\ell \preceq w \preceq w^u} \left\{ (1/n) y^T w \right\} \\ \hat{Y}_\textrm{HT}^\textrm{max} &= \sup_{w^\ell \preceq w \preceq w^u} \left\{ (1/n) y^T w \right\}, \end{align*}
where
\begin{align*} w_i^\ell &= \frac{1}{\Gamma} \hat{w}_i + \frac{n}{N} \left( 1 - \frac{1}{\Gamma} \right) \\ w_i^u &= \Gamma \hat{w}_i - \frac{n}{N} \left( \Gamma - 1 \right). \end{align*}
Call \( \left( \hat{Y}_\textrm{HT}^\textrm{min}, \hat{Y}_\textrm{HT}^\textrm{max} \right) \) a \( \Gamma \)-level sensitivity interval.
Note that
\begin{align*} w_i^u - w_i^\ell &= \hat{w}_i \cdot \left( \Gamma - \frac{1}{\Gamma} \right) - \frac{n}{N} \cdot \left( \Gamma - \frac{1}{\Gamma} \right) \\ &= \left( \hat{w}_i - \frac{n}{N} \right) \cdot \left( \Gamma - \frac{1}{\Gamma} \right) \\ &= \frac{n}{N} \cdot \left( \Gamma - \frac{1}{\Gamma} \right) \cdot \left( \frac{1 - \hat{\pi}_i}{\hat{\pi}_i} \right). \end{align*}
For any \( \Gamma, \) this quantity is decreasing in \( \hat{\pi}_i, \) suggesting individuals with the lowest \( \hat{\pi}_i \) contribute the most sensitivity to errors in the estimated sampling propensities.
When the outcome is non-negative, \( \hat{Y}_\textrm{HT}^\textrm{min} = (1/n) y^T w^\ell \) and \( \hat{Y}_\textrm{HT}^\textrm{max} = (1/n) y^T w^u. \) More generally, \( \hat{Y}_\textrm{HT}^\textrm{min} = (1/n) y^T w^\textrm{min} \) and \( \hat{Y}_\textrm{HT}^\textrm{max} = (1/n) y^T w^\textrm{max}, \) where \( w_i^\textrm{min} = w_i^\ell \) if \( y_i >= 0 \) and \( w_i^u \) otherwise, and \( w_i^\textrm{max} = w_i^u \) if \( y_i >= 0 \) and \( w_i^\ell \) otherwise. The width of the sensitivity interval is:
\begin{align*} \hat{Y}_\textrm{HT}^\textrm{max} - \hat{Y}_\textrm{HT}^\textrm{min} &= \frac{1}{n} \sum_{i \in \mathcal{R}} (w_i^u - w_i^\ell) \cdot | y_i | \\ &= \left( \Gamma - \frac{1}{\Gamma} \right) \cdot \frac{1}{N} \cdot \sum_{i \in \mathcal{R}} \left( \frac{1 - \hat{\pi}_i}{\hat{\pi}_i} \right) \cdot | y_i |, \end{align*}
showing the sensitivity of the estimator to errors in the estimated sampling propensities increases approximately linearly with \( \Gamma \) and is influenced most strongly by those individuals with large values for \( | y_i | \) and/or small estimated sampling propensities. The width of the sensitivity interval does not directly depend on the sample size, \( n. \) A larger sample size decreases variance, but does not decrease sensitivity to errors in the estimated sampling propensities.
Trimming the weights is often used to decrease the variance of weighting estimators, yet if we believe an individual was genuinely unlikely to have been sampled, trimming only makes the estimated weights less accurate, requiring we use a larger \( \Gamma. \) Consider instead discarding individuals with the lowest estimated sampling propensities. This directly reduces the width of the sensitivity interval, but also has bias and variance implications. If the sampling propensities are correlated with the outcome, discarding individuals may increase the bias. By decreasing the sample size, this may increase the variance (by a negligible amount in large samples), but also decreases the variance by removing large weights from the estimator. Thus, the choice to discard individuals involves a three-way tradeoff between bias, variance, and sensitivity to errors in the estimated sampling propensities.
While removing individuals with large outcomes raises concerns of p-hacking, removing individuals with small estimated sampling propensities is a valid component of the design stage of the study, because it can be done before examining outcomes (Rubin 2008).
(Fogarty 2023) calculates
\begin{align*} \hat{Y}_\textrm{WA}^\textrm{min} &= \inf_{w^\ell \preceq w \preceq w^u} \{ y^T w / 1^T w \} \\ \hat{Y}_\textrm{WA}^\textrm{max} &= \sup_{w^\ell \preceq w \preceq w^u} \{ y^T w / 1^T w \} \end{align*}
which correspond to \( \hat{Y}_\textrm{WA}. \) There isn’t a simple closed-form calculation for these quantities, but they can be calculated efficiently by solving a linear-fractional program, which in turn is equivalent to a linear program (Boyd and Vandenberghe 2004, § 4.3.2).
When the weights are estimated based on the sample, it may seem desirable to incorporate that additional source of uncertainty, perhaps using the bootstrap. The true sampling propensities are not stochastic, and they are contained in some \( \Gamma \)-region. Since we are not saying any particular \( \Gamma \) is adequate, there is no need to take extra steps to account for data-driven weights.
The sensitivity interval accounts for errors in the estimated sampling propensities, but does not account for the stochastic sampling mechanism as do p-values and confidence intervals. Yet incorporating uncertainty in the sampling propensities into these quantities is not as straightforward as for point estimates. For example, recall that \( p_> \) is a one-sided p-value providing a measure of evidence (in the typical Frequentist sense) in favor of the alternative \( \bar{y} > \bar{y}_0. \) Assuming the true sampling propensities lie in the \( \Gamma \)-region, the p-value is bounded above by \( p_>^u = \sup_{w^\ell \preceq w \preceq w^u} \{ p_> \}, \) which corresponds to:
\[ \begin{array}{ll} \mbox{minimize} & \left(\hat{Y}_\textrm{HT} - \bar{y}_0 \right) / \sqrt{\widehat{\mathrm{Var}}\left(\hat{Y}_\textrm{HT}\right)} \\ \mbox{subject to} & w^\ell \preceq w \preceq w^u. \end{array} \]
Unfortunately, \( t \) is not convex, nor quasiconvex (it’s affine over the square root of a quadratic), and thus it is not obvious how to calculate \( p_>^u \) efficiently. On the other hand, we may simply use \( \hat{Y}_\textrm{HT}^\textrm{min} \) and the corresponding weights to calculate a test statistic and compute the corresponding p-value. This calculates: \[ t_\textrm{min} = \frac{\hat{Y}_\textrm{HT}^\textrm{min} - \bar{y}_0}{\sqrt{\widehat{\mathrm{Var}}_\textrm{min}}}, \] where \[ \widehat{\mathrm{Var}}_\textrm{min} = \frac{1}{n\cdot(n-1)} \cdot \sum_{i \in \mathcal{R}} \left(w_i^\textrm{min} \cdot y_i - \hat{Y}_\textrm{HT}^\textrm{min}\right) \] and then approximates \( p_>^u \) as \( 1 - \Phi(t_\textrm{min}). \) This is a valid p-value for a possible set of weights, and is likely to be close to the intractable \( p_>^u \) if the dominant variation in \( t \) is due to the numerator.
Similarly, we may approximate \( p_<^u \) as \( \Phi(t_\textrm{max}), \) where \[ t_\textrm{max} = \frac{\hat{Y}_\textrm{HT}^\textrm{max} - \bar{y}_0}{\sqrt{\widehat{\mathrm{Var}}_\textrm{max}}} \] and \[ \widehat{\mathrm{Var}}_\textrm{max} = \frac{1}{n\cdot(n-1)} \cdot \sum_{i \in \mathcal{R}} \left(w_i^\textrm{max} \cdot y_i - \hat{Y}_\textrm{HT}^\textrm{max}\right). \] We approximate \( p_{<>}^u \) as \( 2 \cdot \min\left( 1 - \Phi(t_\textrm{min}), \Phi(t_\textrm{max}) \right). \) An expanded two-sided confidence interval may be calculated as \[ \left[ \hat{Y}_\textrm{HT}^\textrm{min} - \Phi^{-1}(1 - \alpha/2) \cdot \sqrt{ \widehat{\mathrm{Var}}_\textrm{min}}, \hat{Y}_\textrm{HT}^\textrm{max} + \Phi^{-1}(1 - \alpha/2) \cdot \sqrt{ \widehat{\mathrm{Var}}_\textrm{max}} \right]. \] Again, these are approximations, but likely useful ones.
In contrast, (Zhao, Small, and Bhattacharya 2019) uses bootstrap resampling to combine the uncertainty of the sampling propensities and the stochastic sampling mechanism. They create a sequence of synthetic datasets by resampling the original dataset (including the weights selected for each individual) with replacement. In dataset \( b, \) they calculate \( \hat{Y}_\textrm{HT}^{\textrm{min}, (b)} \) and \( \hat{Y}_\textrm{HT}^{\textrm{max}, (b)}. \) Then they calculate an expanded confidence interval as a lower percentile of \( \hat{Y}_\textrm{HT}^{\textrm{min}, (b)} \) and an upper percentile of \( \hat{Y}_\textrm{HT}^{\textrm{max}, (b)}. \) This effectively uses resampling to incorporate the sampling-based uncertainty in \( \hat{Y}_\textrm{HT}, \) and is a model-based approach.
4 A more robust estimator
Suppose we have a model that predicts the outcome, \( y, \) using observed covariates, \( \mu(x). \) This model may have been trained on previous survey waves, but we assume the samples being analyzed were not used to train the model. Thus, from a design-based perspective where the only stochastic element is the sampling mechanism, the predictions are fixed, not random quantities. We do not require the model be “correct” in any sense.
(Särndal, Swensson, and Wretman 2003, § 6.3) discusses the difference estimator:
\begin{align*} \hat{Y}_\textrm{diff} &:= \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (y_i - \mu(x_i)) + \frac{1}{N} \sum_{i=1}^N \mu(x_i) \\ &= \bar{\mu} + \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (y_i - \mu(x_i)). \end{align*}
This estimator uses \( \mu(x) \) to predict the average outcome in the population, and then adjusts this prediction with a weighted sum of residuals.
We can demonstrate this estimator is still unbiased by writing it as \[ \hat{Y}_\textrm{diff} = \left(\frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot y_i \right) + \left(\bar{\mu} - \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot \mu(x_i) \right). \] When \( w_i = \frac{n/N}{\pi_i}, \) the first term is unbiased for \( \bar{y} \) and the last term is unbiased for \( \bar{\mu}, \) so the terms in the second parentheses cancel.
Now take a super-population approach, so that \( (X, Y) \) have some joint distribution in the super-population, and the population is a random IID sample from this super-population. Suppose \( \mu(x) = \mathbf{E}[ Y \, | \, X = x ]. \) Then treating the covariates and outcomes as the stochastic elements rather than the sampling mechanism, write \[ \hat{Y}_\textrm{diff} = \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (Y_i - \mu(X_i)) + \frac{1}{N} \sum_{i=1}^N \mu(X_i) \] to reflect the stochastic nature of \( Y_i \) and \( X_i. \) Let \( X \) and \( x \) concatenate the \( X_i \) and \( x_i, \) \( i=1,\ldots,N. \) We have:
\begin{align*} \mathbf{E}\left[ \hat{Y}_\textrm{diff} \, \middle| \, X = x \right] &= \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (\mathbf{E}[ Y_i \, | \, X = x ] - \mathbf{E}[ \mu(X_i) \, | \, X = x ]) \\ &\phantom{=} \hspace{20pt} + \frac{1}{N} \sum_{i=1}^N \mathbf{E}[ \mu(X_i) \, | \, X = x ] \\ &= \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (\mathbf{E}[ Y_i \, | \, X_i = x_i ] - \mathbf{E}[ \mu(X_i) \, | \, X_i = x_i ]) \\ &\phantom{=} \hspace{20pt} + \frac{1}{N} \sum_{i=1}^N \mathbf{E}[ \mu(X_i) \, | \, X_i = x_i ] \\ &= \frac{1}{n} \sum_{i \in \mathcal{R}} w_i \cdot (\mu(x_i) - \mu(x_i)) + \frac{1}{N} \sum_{i=1}^N \mu(x_i) \\ &= \frac{1}{N} \sum_{i=1}^N \mu(x_i) \\ &= \frac{1}{N} \sum_{i=1}^N \mathbf{E}[ Y_i \, | \, X_i = x_i ] \\ &= \mathbf{E}\left[ \frac{1}{N} \sum_{i=1}^N Y_i \, \middle| \, X = x \right] \\ &= \mathbf{E}\left[ \bar{Y} \, \middle| \, X = x \right]. \end{align*}
Using iterated expectations,
\begin{align*} \mathbf{E}\left[ \hat{Y}_\textrm{diff} \right] &= \mathbf{E} \left[ \mathbf{E}\left[ \hat{Y}_\textrm{diff} \, | \, X \right] \right] \\ &= \mathbf{E} \left[ \mathbf{E}\left[ \bar{Y} \, | \, X \right] \right] \\ &= \mathbf{E}\left[ \bar{Y} \right] \\ &= \bar{y}, \end{align*}
where \( \bar{y} \) now stands in for the super-population mean. Thus, if the weights are correct (in the sense that \( w_i = \frac{n/N}{\pi_i} \)) or the outcome model is correct (in the sense that \( \mu(x) = \mathbf{E}[Y \, | \, X=x] \)), \( \hat{Y}_\textrm{diff} \) is unbiased, a property often called double robustness. We will see \( \hat{Y}_\textrm{diff} \) has desirable properties, even if the outcome model is incorrect.
Consider the variance of \( \hat{Y}_\textrm{diff}: \)
\begin{align*} \mathrm{Var}\left( \hat{Y}_\textrm{diff} \right) &= \frac{1}{n \cdot N} \cdot \sum_{i = 1}^N \frac{1}{w_i} \left( w_i \cdot (y_i - \mu(x_i)) - \left(\bar{y} - \bar{\mu}\right) \right)^2 \\ &= \frac{1}{n \cdot N} \cdot \sum_{i = 1}^N \frac{1}{w_i} \left( w_i \cdot \epsilon_i - \bar{\epsilon} \right)^2, \end{align*}
where \( \epsilon_i = y_i - \mu(x_i) \) and \( \bar{\epsilon} \) is the population mean. If \( \mu \) has any predictive power for the outcome, we might expect \( \epsilon_i \) to be smaller in magnitude than \( y_i, \) and \( \bar{\epsilon} \) to be smaller in magnitude than \( \bar{y}, \) and so we expect \( \mathrm{Var}\left(\hat{Y}_\textrm{diff}\right) \) to be smaller than \( \mathrm{Var}\left(\hat{Y}\right). \) Thus, even when the outcome model is not correct (so we do not achieve the double robustness property), it can still reduce the variance compared to the Horvitz-Thompson estimator. An unbiased estimate for the variance of the difference estimator is:
\[ \widehat{\mathrm{Var}}\left( \hat{Y}_\textrm{diff} \right) = \frac{1}{n\cdot(n-1)} \cdot \sum_{i \in \mathcal{R}} \left( w_i \cdot (y_i - \mu(x_i)) - \left(\hat{Y}_\textrm{diff} - \bar{\mu}\right) \right)^2. \]
Now consider sensitivity analysis for this estimator. The bounds on the point estimate become:
\begin{align*} \hat{Y}_\textrm{diff}^\textrm{min} &= \bar{\mu} + \frac{1}{n} \epsilon^T w^\textrm{min}, \\ \hat{Y}_\textrm{diff}^\textrm{max} &= \bar{\mu} + \frac{1}{n} \epsilon^T w^\textrm{max}. \end{align*}
The width of this interval is \[ \hat{Y}_\textrm{diff}^\textrm{max} - \hat{Y}_\textrm{diff}^\textrm{min} = \left( \Gamma - \frac{1}{\Gamma} \right) \cdot \frac{1}{N} \cdot \sum_{i \in \mathcal{R}} \left( \frac{1 - \hat{\pi}_i}{\hat{\pi}_i} \right) \cdot | \epsilon_i |, \] compared to \[ \hat{Y}_\textrm{diff}^\textrm{max} - \hat{Y}_\textrm{diff}^\textrm{min} = \left( \Gamma - \frac{1}{\Gamma} \right) \cdot \frac{1}{N} \cdot \sum_{i \in \mathcal{R}} \left( \frac{1 - \hat{\pi}_i}{\hat{\pi}_i} \right) \cdot | y_i | \] for the Horvitz-Thompson estimator. Clearly, if the residuals \( \epsilon_i \) tend to be smaller in magnitude than the outcomes, the sensitivity interval will be narrower and the results will be less sensitive to errors in the estimated sampling propensities.
The key insight here is that errors in the estimated sampling propensities do not affect \( \mu \), yet we do not require \( \mu \) to be “correct” in any sense for \( \hat{Y}_\textrm{diff} \) to be unbiased. Assuming the true sampling propensities lie in a \( \Gamma \)-region around \( \hat{\pi}, \) an unbiased estimator corresponds to some set of sampling propensities in that region, and that estimator is bounded by the sensitivity interval. For any \( \Gamma, \) the smaller magnitude of \( \epsilon \) compared to \( y \) leads us to expect tighter bounds for the difference estimator than for Horvitz-Thompson. Even though we don’t know how accurate our model for the sampling propensities is, sensitivity analysis helps us identify a superior test statistic.
We might consider a weighted-average version of a difference estimator, say: \[ \hat{Y}_\textrm{diff, WA} = \bar{\mu} + \frac{\sum_{i \in \mathcal{R}} w_i \cdot (y_i - \mu(x_i))}{\sum_{i \in \mathcal{R}} w_i}. \] Approximate unbiasedness follows from the approximate unbiasedness of \( \hat{Y}_\textrm{WA} \) (Särndal, Swensson, and Wretman 2003, Result 5.7.1) Sensitivity analysis for such an estimator follows from (Fogarty 2023).
5 Lower bounds on sensitivity parameters
We may have not just one sample, \( \{ y_i \}_{i \in \mathcal{R}}, \) but rather \( J \geq 2 \) samples drawn independently from the same population, \( \{ \{ y_i^{(j)} \}_{i \in \mathcal{R}_j} \}_{j=1}^J. \) Suppose the \( j \)th sample involves \( n_j \) individuals. We will assume that \( n_j \) is large enough to justify normal approximations, and that \( N \) is large enough relative to \( n_j \) to justify the simple sampling-with-replacement approximations we’ve made.
If the sampling mechanisms differ, the \( J \) samples may differ in terms of observed and unobserved covariates. Even after adjusting for observed covariates, the unobserved differences remain, and different samples may provide different estimates of the population mean. Specifically, let \( \hat{Y}^{(j)} \) be an estimate of the population mean based on sample \( j. \) There is no obligation to use the same method for all samples. We might use the Horvitz-Thompson estimator for some samples and the difference estimator for others.
Even if we have adjusted for all relevant covariates, so that each \( \hat{Y}^{(j)} \) is an unbiased estimate of the population mean, we would not expect the estimates to be identical. Sampling variation is expected. Yet greater-than-expected differences would cast suspicion on whether our adjustments were adequate.
For example, suppose \( J = 2. \) If the two estimates are both unbiased for the population mean, then \( \hat{Y}^{(2)} - \hat{Y}^{(1)} \) is zero mean, and has variance equal to the sum of the variances of the two estimators. Then the two-sample \( z \)-statistic \[ t_2 = \frac{\hat{Y}^{(2)} - \hat{Y}^{(1)}}{\sqrt{\widehat{\mathrm{Var}}(\hat{Y}^{(1)}) + \widehat{\mathrm{Var}}(\hat{Y}^{(2)}) }} \] has an asymptotic standard normal distribution. If \( p_{<>} := 2 \cdot \min(\Phi(t_2), 1 - \Phi(t_2)) \) were lower than some agreed-upon threshold \( \alpha, \) we would begin looking for explanations for the larger-than-expected discrepancy.
One obvious explanation is that \( \mathbf{E}\left[ \hat{Y}^{(1)} \right] \) does not in fact equal \( \mathbf{E}\left[ \hat{Y}^{(2)} \right], \) indicating that either or both estimators are biased. This in turn provides evidence one or both sets of estimated sampling propensities are in error.
A \( \chi^2 \)-test supports \( J > 2. \) Let \[ \bar{Y} = \frac{\sum_{j=1}^J v_j \cdot \hat{Y}^{(j)}}{\sum_{j=1}^J v_j}, \] where \( v_j = n_j / \widehat{\mathrm{Var}}\left(\hat{Y}^{(j)}\right). \) Then \[ X = \sum_{j=1}^J v_j \cdot \left( \hat{Y}^{(j)} - \bar{Y} \right)^2 \] has an asymptotic \( \chi_{J-1}^2 \) distribution. Sufficiently large values of \( X \) are evidence of residual bias.
Suppose we want to check whether it is plausible that \[ \frac{1}{\Gamma^{(j)}} \leq \frac{\pi_i^{(j)} / \left(1 - \pi_i^{(j)}\right)}{\hat{\pi}_i^{(j)} / \left(1 - \hat{\pi}_i^{(j)}\right)} \leq \Gamma^{(j)}, \] for all \( i \) and all \( j, \) and for a particular set \( \{ \Gamma^{(j)} \}, \) with \( \Gamma^{(j)} \geq 1.\) Let \( \vec{\Gamma} \) be the vector with \(j\)th entry \( \Gamma^{(j)}. \) Let \[ X^\star( \vec{\Gamma} ) = \inf_{w_j^\ell \preceq w_j \preceq w_j^u} \{ X \}, \] where \( w_j \) is a set of weights to be used to calculate \( \hat{Y}^{(j)} \) and \(w_j^\ell, \) \( w_j^u \) are the boundaries of a \( \Gamma^{(j)} \)-region.
The test statistic depends on the weights in a complicated way. Assume \( \hat{Y}^{(j)} \) is an affine function of the weights \( w_j, \) as \( \hat{Y}_\textrm{HT} \) and \( \hat{Y}_\textrm{diff} \) are, but not \( \hat{Y}_\textrm{WA}. \) Hold the estimated variances fixed at their values corresponding to \( \hat{w}^{(j)}. \) Then \( X \) is a convex quadratic function of the weights, and calculating \( X^\star \) involves solving a quadratic program.
If the estimates differ wildly, and the purported \( \left\{ \Gamma^{(j)} \right\} \) are small, it is likely that \( X^\star \) will still be larger than the reference point of the \( \chi^2 \) distribution, \( x_\textrm{crit} := \Phi_{\chi_{J-1}^2}^{-1}(1 - \alpha). \) We would conclude the biases must exceed those implied by the \( \{ \Gamma^{(j)} \}. \) The set \( G = \left\{ \vec{\Gamma} \, : \, X^\star \geq x_\textrm{crit} \right\} \) includes all plausible sets of sensitivity parameters.
If \( \vec{\Gamma}_1 \) and \( \vec{\Gamma}_2 \) are both in \( G, \) and \( \Gamma_1^{(j)} \leq \Gamma_2^{(j)} \) for all \( j, \) with strict inequality holding for at least one \( j, \) say that \( \vec{\Gamma}_1 \) dominates \( \vec{\Gamma}_2. \) Define the Pareto optimal set as the set \( P \subseteq G \) where no point in \( P \) is dominated by any point in \( G. \) The set \( P \) specifies the set of smallest plausible sensitivity parameters. It would be preferable to have an upper bound on the sensitivity parameters, but such a bound does not exist unless the true population mean is known. Characterizing the minimum uncertainty associated with an estimate is a core statistical exercise (Tukey 1986).
6 Combining multiple, contradictory samples
Consider using \( J \) samples to test \( H_0: \bar{y} = \bar{y}_0, \) giving \( J \) p-values, \( p_1, \ldots, p_J. \) These could be one- or two-sided p-values; they could be p-values calculated under the assumption that \( \Gamma^{(j)} = 1 \) for all \( j, \) or upper bounds on p-values calculated for a purported \( \vec{\Gamma}. \)
These p-values may be combined in a meta-analysis to synthesize the evidence against any particular \( \bar{y}_0. \) For example, Fisher’s method calculates \( p_m \) by comparing \[ -2 \cdot \sum_{j=1}^J \log(p_j) \] to a \( \chi_{2J}^2 \) distribution, while the truncated product method takes the product of any p-values below a specified cutoff (Zaykin et al. 2002). The Hodges-Lehmann method may be used to calculate a synthesized point estimate (Hodges and Lehmann 1963).
Let
\begin{align*} \hat{Y}_\textrm{meta}^\textrm{min}\left(\vec{\Gamma}\right) &= \inf \{ \bar{y}_0 \, : \, p_m \geq \alpha \} \\ \hat{Y}_\textrm{meta}^\textrm{max}\left(\vec{\Gamma}\right) &= \sup \{ \bar{y}_0 \, : \, p_m \geq \alpha \}, \end{align*}
where \( p_m \) implicitly depends on \( \vec{\Gamma}. \) Let
\begin{align*} \hat{Y}_\textrm{meta}^\textrm{min} &= \inf_{\vec{\Gamma} \in P} \left\{ \hat{Y}_\textrm{meta}^\textrm{min}\left(\vec{\Gamma}\right) \right\} \\ \hat{Y}_\textrm{meta}^\textrm{max} &= \sup_{\vec{\Gamma} \in P} \left\{ \hat{Y}_\textrm{meta}^\textrm{max}\left(\vec{\Gamma}\right) \right\}. \end{align*}
These quantities represent extremal bounds on the confidence interval over the minimum uncertainty Pareto optimal set. In a sense, they represent the tightest interval that plausibly contains the population mean, given sampling variability and the unknown pattern of sensitivity parameters.
7 Summary
(Rubin 2008) extolled the virtues of design-based methods, which focus on the sampling mechanism rather than the outcome generating process. Design-based methods lend themselves to sensitivity analysis, which reduces the burden of imperfectly understanding the sampling mechanism. Yet we saw that an outcome model can both reduce the variance of the estimate, and improve robustness to errors in the sampling model.
The man with two watches never knows what time it is, yet we saw multiple, even contradictory estimates of a population mean, are informative about the accuracy of the estimated sampling propensities. Armed with this knowledge, we can combine these estimates in a meta-analysis into an overall estimate of the population mean.
We have reviewed the classic Horvitz-Thompson estimator, a weighted average variant, and an estimator leveraging a model for the outcome. We performed a design-based sensitivity analysis to calculate sensitivity intervals, expanded confidence intervals, and adjusted p-values. We discussed two strategies for reducing sensitivity to errors in estimated sampling propensities: discarding individuals with the smallest estimated sampling propensities (which may increase bias), and incorporating a model for the outcomes. We described how to calculate minimum uncertainty sets for sampling propensity errors, and how to combined multiple samples to calculate an overall estimate.
8 Appendix: Derivations
In this section, we derive the variance of \( \hat{Y}_\textrm{HT} = \frac{1}{n} \sum_{i \in \mathcal{R}} w_i y_i. \) It will turn out that this variance depends on all outcomes, \( y_i, \) even the unobserved ones, so we cannot actually calculate this variance. Luckily, there is a simple variance estimator depending only on observed quantities, and that variance estimator is unbiased.
Above, we showed that \( \hat{Y}_\textrm{HT} \) is unbiased for the population average, \( \bar{y}, \) using sampling indicators \( Z_i. \) Attempting a similar strategy for deriving the variance quickly runs into quantities \( \pi_{ij} = \mathrm{Pr}\{ Z_i = 1 \textrm{ and } Z_j = 1 \}.\) In our applications, it’s hard enough to estimate the first order inclusion probabilities, \( \pi_i = \mathrm{Pr}\{ Z_i = 1 \}, \) let alone the second order inclusion probabilities.
Instead, (Lohr 2010) suggests positing a sampling-with-replacement model, with \( Q_i \) being the number of times unit \( i \) is included in the sample. Of course, in practice \( Q_i \) is either 0 or 1 for all units, but using this notation removes the need for second-order inclusion probabilities.
Let \( z_i \) be the probability individual \( i \) is sampled in a given draw. We have \( \sum_{i=1}^N z_i = 1 \) and \( z_i > 0 \) for all \( i. \) The probability an individual is sampled at least once is \( \pi_i = 1 - (1 - z_i)^n \approx n \cdot z_i \) when \( z_i \) is small. Inverting this relationship, we have \( z_i = 1 - (1 - \pi_i)^{1/n} \approx \pi_i / n \) when \( \pi_i \) is small. Thus, \( Q_i \sim \mathrm{Binom}(n, z_i), \) \( \mathbf{E}[Q_i] = n \cdot z_i \approx \pi_i, \) and \( \mathrm{Var}(Q_i) = n \cdot z_i \cdot (1 - z_i) \approx \pi_i \cdot (1 - \pi_i / n). \)
Even though sampling is independent in each draw, \( Q_i \) and \( Q_j \) are negatively correlated since, if unit \( i \) is sampled, that makes it less likely that unit \( j \) was sampled, because there are fewer opportunities to draw \( j. \) It turns out \( \mathrm{Cov}(Q_i, Q_j) = -n \cdot z_i \cdot z_j. \) To see this, let \( R_{ik} \) indicate whether unit \( i \) was sampled in the \( k \)th draw. Then \( Q_i = \sum_{k=1}^n R_{ik} \) and
\begin{align*} \mathrm{Cov}(Q_i, Q_j) &= \mathrm{Cov}\left( \sum_{k=1}^n R_{ik}, \sum_{\ell=1}^n R_{j\ell} \right) \\ &= \sum_{k=1}^n \sum_{\ell=1}^n \mathrm{Cov}\left( R_{ik}, R_{j\ell} \right). \end{align*}
When \( k \neq \ell, \) \( \mathrm{Cov}(R_{ik}, R_{j\ell}) = 0, \) since draws are independent. When \( i \neq j \) and \( k = \ell, \) \( \mathrm{Cov}(R_{ik}, R_{j\ell}) = \mathbf{E}[ R_{ik} R_{jk}] - \mathbf{E}[R_{ik}] \cdot \mathbf{E}[R_{jk}]. \) Since we can’t sample different units in the same draw, \( R_{ik} R_{jk} = 0. \) And \( \mathbf{E}[R_{ik}] = z_i, \) so \( \mathrm{Cov}(R_{ik}, R_{jk}) = - z_i \cdot z_j, \) and \( \mathrm{Cov}(Q_i, Q_j) = \sum_{k=1}^n \mathrm{Cov}(R_{ik}, R_{jk}) = -n \cdot z_i \cdot z_j. \)
Turning now to the estimator, \( \hat{Y}_\textrm{HT} = \frac{1}{n} \sum_{i=1}^N Q_i \cdot w_i \cdot y_i, \) so \( \mathbf{E}\left[ \hat{Y}_\textrm{HT} \right] = \frac{1}{n} \sum_{i=1}^N n \cdot z_i \cdot w_i \cdot y_i \approx \bar{y} \) when \( w_i = \frac{n/N}{\pi_i}. \) This estimator is only approximately unbiased, but the approximation is good for large samples.
\begin{align*} \mathrm{Var}\left( \hat{Y}_\textrm{HT} \right) &= \frac{1}{n^2} \left[ \sum_{i=1}^N \mathrm{Var}(Q_i) \cdot w_i^2 \cdot y_i^2 + \sum_{i=1}^N \sum_{j \neq i} \mathrm{Cov}(Q_i, Q_j) \cdot w_i \cdot w_j \cdot y_i \cdot y_j \right] \\ &= \frac{1}{n^2} \left[ \sum_{i=1}^N n \cdot z_i \cdot (1 - z_i) \cdot w_i^2 \cdot y_i^2 - n \sum_{i=1}^N \sum_{j \neq i} z_i \cdot z_j \cdot w_i \cdot w_j \cdot y_i \cdot y_j \right] \\ &= \frac{1}{n} \left[ \sum_{i=1}^N z_i \cdot w_i^2 \cdot y_i^2 - \sum_{i=1}^N \sum_{j=1}^N z_i \cdot z_j \cdot w_i \cdot w_j \cdot y_i \cdot y_j \right] \\ &= \frac{1}{n} \left[ \sum_{i=1}^N z_i \cdot w_i^2 \cdot y_i^2 - \left( \sum_{i=1}^N z_i \cdot w_i \cdot y_i \right)^2 \right]. \end{align*}
When \( w_i = \frac{n/N}{\pi_i} \approx 1/(N \cdot z_i), \)
\begin{align*} \mathrm{Var} \left( \hat{Y}_\textrm{HT} \right) &\approx \frac{1}{N \cdot n} \left[ \sum_{i=1}^N \frac{y_i^2}{N\cdot z_i} - \frac{1}{N} \left( \sum_{i=1}^N y_i \right)^2 \right] \\ &= \frac{1}{N \cdot n} \left[ \sum_{i=1}^N \left( \frac{y_i^2}{N\cdot z_i} - \bar{y} \cdot y_i \right) \right] \\ &= \frac{1}{N \cdot n} \left[ \sum_{i=1}^N \left( \frac{y_i^2}{N\cdot z_i} - 2 \cdot \bar{y} \cdot y_i + N \cdot z_i \cdot \bar{y}^2 + \bar{y} \cdot y_i - N \cdot z_i \cdot \bar{y}^2 \right) \right] \\ &= \frac{1}{N \cdot n} \left[ \sum_{i=1}^N N \cdot z_i \cdot \left( \frac{y_i}{N \cdot z_i} - \bar{y} \right)^2 + N \cdot \bar{y}^2 - N \cdot \bar{y}^2 \cdot \sum_{i=1}^N z_i \right] \\ &= \frac{1}{N \cdot n} \sum_{i=1}^N N \cdot z_i \cdot \left( \frac{y_i}{N \cdot z_i} - \bar{y} \right)^2 \\ &\approx \frac{1}{N \cdot n} \sum_{i=1}^N \frac{1}{w_i} \cdot \left( w_i \cdot y_i - \bar{y} \right)^2. \end{align*}
The approximations in this derivation can be made exact if we instead use \( w_i = 1 / (N \cdot z_i). \) Since the weights used in practice are estimates anyway, we can incorporate any difference between \( \pi_i \) and \( n \cdot z_i \) into the sensitivity analysis.
We cannot directly calculate this quantity because it depends on \( \bar{y}, \) which we do not know. (If we did, inference would be unnecessary.) Neither do we know \( y_i \) for the unsampled individuals. Instead we show that \[ \widehat{\mathrm{Var}}\left(\hat{Y}_\textrm{HT}\right) = \frac{1}{n \cdot (n-1)} \cdot \sum_{i \in \mathcal{R}} \left( w_i \cdot y_i - \hat{Y}_\textrm{HT} \right)^2. \] is an unbiased estimate of \( \mathrm{Var}\left( \hat{Y}_\textrm{HT} \right). \)
- Proof that \( \widehat{\mathrm{Var}}(\hat{Y}_\textrm{HT}) \) is unbiased for \( \mathrm{Var}(\hat{Y}_\textrm{HT}). \)