Tests with One-Sided Noncompliance

Table of Contents

Introduction

Tech companies spoil data scientists. It’s so easy for us to A/B test everything. We can alter many aspects of the product from a configuration UI. We have the sample size to get a good read in as little as a few days. We have the data infrastructure to analyze and report results quickly.

With these advantages, we sometimes neglect ingenuity when confronting adversity. We discard an imperfect test too quickly, preferring to start over when better options exist.

Consider a common test design. A feature gate releases a new feature to a random subset of our user base. Comparing behaviors between the treatment group exposed to the new feature and the unexposed holdout tells us about the impact of the feature.

We must consider network effects with this design. Suppose Alice gets the new feature and tells her friend Brian about it. If Brian is in the holdout group, he might feel left out. His negative reaction could affect his subsequent behavior. In this scenario, Alice’s test assignment influences Brian’s behavior. This is a violation of the Stable Unit Treatment Value Assumption (SUTVA). I’ve written about this before and won’t discuss it any further here.

Sometimes a computer bug leaks access to people in the holdout. This happened in my first A/B test. The client device would request an experiment group assignment from the server. If the server didn’t respond in time, the device would self-assign to the treatment group with the new feature enabled. (The client always had the feature enabled, and only disabled it if the server said to.) As a result, clients with poor internet connections sometimes received the feature.

In another test, the server went down for a few hours. Again, the client had the feature available and relied on the server to tell it to disable it. No server, no disabling, and everyone using the service at the time received the feature.

We should always strive to administer an A/B test faithfully, but bugs exist. Upon encountering one, we must decide how to proceed.

These bugs have an asymmetric impact on tests. Only the holdout groups are affected. This is called “one-sided noncompliance”. Other scenarios involve two-sided noncompliance, but the one-sided case is common and warrants special discussion.

As Treated, Per Protocol, and Intent to Treat

Such an obstacle may tempt us to throw out the test and start over, often with just a few button clicks in a dashboard. Before doing so, consider that some people have already experienced the feature. If we start a new test, some people who had access to the new feature, no longer will. Their behavior may differ from someone who never had access to begin. Better to salvage what we can from the test as conducted.

Some data scientists jump to approaches called “as treated” or “per protocol” analyses (Imbens and Rubin 2015, §23.9). “As treated” analyses combine the people in the holdout who received the feature with the people in the treatment group, prioritizing exposure over the treatment assignment. “Per protocol” discards the people in the holdout group who received the new feature instead of lumping them in with the treatment group.

Neither approach yields valid inferences. In our first example, people in the holdout group with poor internet connections received the feature, against protocol. In the “as treated” analysis, grouping by exposure means there are too many people with poor internet connections in the treatment group and too few in the holdout group.

Grouping by random assignment ensures balance (on average) across all covariates, including internet connectivity. Grouping by exposure introduces imbalance. If people with poor internet connectivity tend to have worse outcomes (such as retention or purchases), then the estimated impact under-reports the true impact. “Per protocol” does not change the treatment group, but still produces imbalance by removing people with poor internet connections from the holdout group, biasing the estimated impact.

We could just pretend the bug didn’t occur: we would analyze people according to their group assignment, ignoring actual exposure to the feature. Since random assignment balances covariates across both groups, this is an apples-to-apples comparison.

This “Intent-to-Treat” (ITT) design measures the impact of the group assignment, not the impact of the feature (Imbens and Rubin 2015, §23.4). We don’t want to measure the impact of group assignment, we want to measure the impact of exposure to the feature. ITT measures the wrong thing, but what it estimates, it estimates without bias. With minor contamination, this distinction may not matter.

Potential Outcomes Notation

We can adjust the Intent-to-Treat estimate, aligning it with the impact of the feature itself. To motivate this approach, consider why Intent-to-Treat fails. ITT combines the impact among two groups: compliers (people who adhere to their test assignment and who would only be exposed to the feature if assigned to the treatment group), and always-takers (people who would be exposed to the feature regardless of group assignment).

The potential outcomes notation clarifies this distinction. Consider a particular individual in the holdout group, $i.$ Let $Z_i(0) = 1$ if they were exposed to the feature (because of the bug), and 0 otherwise.

We expand this definition to apply to any individual, not just in the holdout group, by letting $Z_i(0)$ represent a potential exposure. If person $i$ is in the treatment group, $Z_i(0)$ indicates whether they would have been exposed to the feature, had they (counterfactually) been assigned to the holdout group. We don’t observe $Z_i(0)$ since the person wasn’t assigned to the holdout group; we can only speculate.

These counterfactual outcomes are like a tree falling in the forest with no one around to hear it. By definition, we can’t know whether the tree makes a sound, but it either does or it doesn’t, and we certainly can speculate about it. Counterfactual outcomes are unobservable, but they are well-defined (for well-defined treatments), and play an important role in causal inference.

Let $Z_i(1)$ indicate whether individual $i$ would have been exposed to the feature if assigned to the treatment group. Throughout, we have assumed that everyone in the treatment group is exposed to the feature. In some instances, that may not be true; the noncompliance might be two-sided. Because our noncompliance is one-sided, we know that $Z_i(1) = 1$ for everyone, even for people in the holdout group. We do not need to speculate about this counterfactual outcome: we know it with certainty.

We can now categorize everyone as either a complier or an always-taker according to their potential exposures, as tabulated below.

$Z_i(0)$ $Z_i(1)$ Cohort
0 1 Complier
1 1 Always-Taker

Only $Z_i(0)$ distinguishes compliers from always-takers. We do not need to observe $Z_i(1)$ because we already know it equals 1 for everyone. We observe $Z_i(0)$ for people in the holdout group, and so we observe their cohort. If person $i$ in the holdout group was affected by the bug, they are an always-taker. Otherwise, they are a complier. For people in the treatment group, we can only speculate about their cohort.

Thanks to random assignment, the proportion of men and women is roughly equal in both treatment and holdout groups. The distribution of internet connectivity is roughly equal, too. Random assignment balances all covariates, including $Z(0).$ Although we don’t observe complier or always-taker status in the treatment group, random assignment ensures the proportion of compliers is roughly equal in both treatment and holdout groups. Since we observe the proportion of compliers in the holdout group, we can estimate the proportion of compliers in the treatment group, and in the overall population. We exploit this insight to adjust the Intent-to-Treat estimate.

This notation is not limited to the exposure status. Let $Y_i(0)$ be the outcome (retention, purchases, whatever) we would observe for person $i$ if they were assigned to the holdout, and let $Y_i(1)$ be the outcome if assigned to the treatment group. If person $i$ is assigned to the holdout group, we observe $Y_i(0)$ and not $Y_i(1),$ and vice versa if the person is assigned to the treatment group. We call $(Z_i(0), Z_i(1))$ the potential exposures for individual $i$, and $(Y_i(0), Y_i(1))$ the potential outcomes.

The effect of treatment assignment on person $i$ is $Y_i(1) - Y_i(0).$ Since we never observe both potential outcomes for the same individual, individual effects are unobservable. But thanks to random assignment, the difference in average outcomes, treatment minus holdout, is an unbiased estimate of the average effect of assignment across everyone in the test. This is just the Intent-to-Treat estimate.

Instrumental Variables

The Intent-to-Treat approach combines the impact (of random assignment) among two cohorts, compliers and always-takers, across two groups, treatment and holdout. Consider these four combinations in turn.

The compliers in either group are unaffected by the bug. Their potential exposures, $Z_i(0) = 0$ and $Z_i(1) = 1,$ and potential outcomes, $Y_i(0)$ and $Y_i(1),$ are uncontaminated. The always-takers in the treatment group are contaminated only hypothetically. Their observed exposure, $Z_i(1) = 1,$ and outcome, $Y_i(1),$ are the same as if the bug had not occurred, and only observations affect conclusions.

Only the always-takers in the holdout group are problematic. Their exposure is different than it should be: $Z_i(0) = 1,$ and so their outcome, $Y_i(0)$ is potentially different. If, for example, effect of exposure on individual $i$ is positive, then $Y_i(0)$ is inflated by the bug. As a result, the effect of random assignment on individual $i,$ $Y_i(1) - Y_i(0),$ is deflated by the bug, and the average across everyone in the test—what Intent-to-Treat estimates—is distorted.

Now we apply some business logic. Regardless of the direction of impact (positive or negative), accidental exposure brings $Y_i(0)$ and $Y_i(1)$ closer together. For negligible exposure, there is negligible distortion of the treatment effect. For lasting exposure, say for the entire test duration, the distortion is total, $Y_i(0) = Y_i(1).$ Intermittent exposure leads to intermediate distortion.

Depending on the duration of exposure, and the nature of the feature, we might consider one extreme or the other more plausible. For example, very brief exposure might not have any impact on their outcome. The outcome actually observed might match the outcome we would have observed if not for the bug. In that case, we trust the Intent-to-Treat estimate.

At the other extreme, prolonged exposure might lead us to expect an outcome nearly the same as it would have been had they been assigned to the treatment group. In the slow-internet example, people with poor internet connections were always exposed to the feature. In this case, the test assignment for this group is inconsequential, and the impact is zero. The Instrumental Variables approach operates under this assumption.

The Intent-to-Treat estimate combines the normal impact (for the compliers) and distorted impact (for the always-takers): $$ \textrm{ITT} = \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers} + \textrm{Impact}_\textrm{always-takers} \cdot \pi_\textrm{always-takers}, $$ where $\pi_\textrm{compliers}$ and $\pi_\textrm{always-takers}$ are the proportion of compliers and always-takers in the test. Since everyone is either a complier or an always-taker, we have $\pi_\textrm{compliers} + \pi_\textrm{always-takers} = 1.$ In a scenario where $\textrm{Impact}_\textrm{always-takers} = 0,$ this equation simplifies to: $$ \begin{align} \textrm{ITT} &= \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers}, \textrm{ or} \\ \textrm{Impact}_\textrm{compliers} &= \frac{\textrm{ITT}}{\pi_\textrm{compliers}} \\ &= \frac{\textrm{ITT}}{1 - \pi_\textrm{always-takers}}. \quad (1) \end{align} $$

This formula is called the Instrumental Variables estimator for one-sided noncompliance. It adjusts the ITT estimate up (dividing by a number less than one) to adjust for the noncompliance.

Thanks to random assignment, we can estimate the proportion of always-takers from the holdout group: it’s the fraction of people in the holdout who were exposed to the feature. Notice the last line of Equation (1). If the bug affects few people, then we divide the ITT estimate by a number close to one, barely adjusting the impact.

In this approach, we estimate the impact of the feature only on the compliers. For them, the impact of the feature (the desired insight) equals the impact of random assignment (what we can measure). For the always-takers, the impact of assignment has no connection to the impact of the feature. Indeed, assuming zero impact of assignment allowed us to infer the feature impact on the compliers.

We might hope to infer the effect of the feature for the always-takers based on the effect for the compliers. In my work, I expect heterogeneous treatment effects varying from individual to individual. In this setting, we have no reason to expect we can learn anything about the effect of a feature on some people by estimating the effect on others, however well we estimate it.

With minor contamination, the cohort of always-takers will be small. A Manski-style sensitivity analysis may relieve any anxiety (Manski 2003). The average impact of the feature is: $$ \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers} + \textrm{Impact}_\textrm{always-takers} \cdot \pi_\textrm{always-takers}, $$ where we have estimates for every quantity except $\textrm{Impact}_\textrm{always-takers}$. If the outcome is binary (a person either retains or not; they either purchase or not), the impact is bounded between -1 (if the feature dissuades everyone from purchasing) and +1 (if the feature leads to everyone purchasing).

Then the endpoints of a Manski-style interval on the impact are: $$ \begin{align} & \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers} \pm \pi_\textrm{always-takers}, \textrm{ or} \\ & \textrm{ITT} \pm \pi_\textrm{always-takers}. \end{align} $$ When $\pi_\textrm{always-takers}$ is small, this interval will be narrow. This derivation makes no assumption about the impact on the always-takers, but does rely on the impact being bounded, as would be the case with binary outcomes.

Dose-Response Models

Instrumental Variables (as we have presented it here for one-sided noncompliance) assumes the impact of assignment on the always-takers was zero. With brief but consequential exposure, the outcome may be distorted, but not totally. In this case, a more detailed model regarding the relationship between the exposure and the outcome may help. The IV and ITT estimators still bound the impact, though, and with minor contamination that will often be good enough.

Suppose the impact of the feature depends on how much people use it. People who rarely use it experience little impact, and people who use it often experience larger impact. We might start off with a simple linear model relating exposure and outcome: $$ Y_i(1) - Y_i(0) = \beta \cdot (Z_i(1) - Z_i(0)). \quad (2) $$ This formula modifies the potential exposures to represent the amount of feature usage, perhaps in terms of time spent or number of sessions. For a complier, $Z_i(0)$ is zero as before, but for someone exposed to the bug, $Z_i(0) > 0$. Even for such a person, intermittent exposure would suggest $Z_i(0) < Z_i(1)$, quantifying partial contamination.

For someone affected by the bug, $Y_i(0)$ is higher than it’s supposed to be, and $Y_i(1) - Y_i(0)$ is lower than it’s supposed to be. This model states that the difference in potential outcomes is proportional to the difference in potential exposures. When we discussed the Intent-to-Treat estimate, we said it under-reported the treatment effect. The dose-response model offers an explanation: the difference in potential outcomes is too low because the difference in potential exposures is too low.

In this model, $\beta$ is the effect of increased exposure on the outcome, in whatever units we measure exposure. We can solve Equation (2) for $\beta$: $$ \beta = \frac{Y_i(1) - Y_i(0)}{Z_i(1) - Z_i(0)}, $$ and substitute sample averages: $$ \hat{\beta} = \frac{\bar{Y}_1 - \bar{Y}_0}{\bar{Z}_1 - \bar{Z}_0}, \quad (3) $$ where, for example, $\bar{Y}_1$ is the average outcome in the treatment group.

The numerator in (3) is the ITT estimator. The denominator is the effect of being assigned to the treatment group on exposure. If we condense exposure to a binary indicator, then $\bar{Z}_1 = 1,$ since everyone in the treatment group was exposed, and $\bar{Z}_0 = \pi_\textrm{always-takers},$ giving exactly the same formula as the Instrumental Variables estimate. But Equation (3) is more general, and can be used to support more nuanced concepts of exposure. It can also be used with two-sided noncompliance (when people in the treatment group don’t actually use the feature), but we don’t discuss that here (Imbens and Rubin 2015, §23–§24).

Conclusions and Further Reading

Data scientists in tech companies can often start an A/B test with a few button presses. Because it’s so easy, when something goes wrong, our first instinct is often to start over. But it’s often better to analyze an imperfect test. Intent-to-Treat and Instrumental Variables analyses give good insights for tests with minor contamination. Two other approaches, “as treated” and “per protocol” should be avoided.

We limited our discussion to one-sided noncompliance. For more details on one-sided and two-sided noncompliance, see Imbens and Rubin (2015). They provide guidance on hypothesis tests and confidence intervals for the simple IV estimator (1) and the more general formula (3).

Formulating partial contamination as a dose-response model was inspired by the works of Paul Rosenbaum (e.g. Rosenbaum 2020, §18.4). He provides alternative approaches to hypothesis tests, point estimates, and confidence intervals using randomization inference. In practive, I have found the population-based inference discussed in Imbens and Rubin to give similar answers to the randomization inference championed by Rosenbaum.

Randomization inference provides valid inference even under severe noncompliance, when the bug affects almost everyone for almost the entirety of the test. Common population-based inferences break down in this scenario of “weak instruments”, often with no obvious indication. But because we only learn about the effect on the compliers, with severe non-compliance and heterogeneous treatment effects, this insight becomes less and less valuable.

References

  • Guido W. Imbens and Donald B. Rubin (2015) “Causal Inference for Statistics, Social, and Biomedical Sciences”. Cambridge University Press.
  • Charles F. Manski (2003) “Partial Identification of Probability Distributions”. Springer Series in Statistics.
  • Paul R. Rosenbaum (2020) “Design of Observational Studies”. Springer Series in Statistics.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related