Instrumental Variables

Data scientists at tech companies are spoiled. It’s so easy for us to A/B test everything. We have the hooks in place to easily alter some aspect of the product with minimal engineering lift, we have the sample size to get a good read in as little as a few days, and we have the data infrastructure to quickly analyze and report results.

Perhaps because it’s so easy for us, at the first sign of trouble we’re tempted to just cancel the test and start over. In this post, I want to talk about a situation that has occurred a few times in my career and the right way to handle it.


A common test design is a feature gate: some new feature is developed and we want to measure the impact by releasing it to some people and not others. One consideration with this design is network effects: if Alice gets the new feature and tells her friend Brian about it, but Brian is in the holdout group, Brian’s subsequent behavior might be influenced by Alice’s test assignment. This is a violation of the Stable Unit Treatment Value Assumption (SUTVA). I’ve written about this before and won’t discuss it any further here.

But sometimes gates aren’t closed propely, and some people who weren’t supposed to receive the feature, do nevertheless gain access to the feature. The very first A/B test I was ever involved with had this happen. There was this weird edge case where the client device would ask the server “which group am I in?” but if the server didn’t respond in time, the device would self-assign to the treatment group with the new feature enabled. (The client always had the feature enabled, and only disabled it if the server said to.) So clients with poor internet connections didn’t get the feature disabled properly. This is a bad A/B Test Administration design, but that’s what happened.

There was another test where the server just went down for a few hours. Again, the client had the feature available and relied on the server to tell it to disable it. No server, no disabling, and everyone who happened to be using the service at that time received the feature.

Clearly, this isn’t ideal. We should always strive to administer an A/B test as cleanly as possible. But software inevitable has bugs, and when we run into one, we have to figure out how to move forward.

Some Options

Of course, the first option is to throw out the test and start over. This is never a great option, but sometimes it’s a terrible option. At this point, some people have already been exposed to the feature. If you “reshuffle the deck” and start a new test, some people who previously had access to the new feature, no longer will. And that may influence their subsequent behavior in a way that is different from someone who has never had access. I’ll let you think through other edge cases here.

Most data scientists encountering this issue for the first time will jump to alternatives called “as treated” or “per protocol” analyses. (Spoiler alert: these don’t work, but they’re still the first thing that everyone thinks of.) “As treated” analyses compares the outcomes among people exposed to the feature and not exposed to the feature, ignoring the fact that because of the bug, exposure is not completely random. “Per protocol” discards the people in the holdout group who were accidentally exposed.

Let’s look at the first example to see why this doesn’t work. The bug is that people in the holdout group with poor internet connections were accidentally exposed to the feature. So in the “as treated” analysis, people with poor internet connectivity are over-represented in the exposed group, and under-represented in the unexposed group. If people with poor internet connectivity tend to have worse outcomes (such as retention or purchases), then the measured impact will under-report the true impact. “Per protocol” isn’t really any better: people with poor internet connectivity get discarded, but only from the holdout group. So they are still underrepresented in the holdout, and the average outcome in the holdout is artificially inflated.

The next option people consider is called the “intent to treat” (ITT) analysis. We just pretend the bug didn’t occur: people who were assigned to the holdout group will be analyzed as such, even if they were exposed to the feature. This does result in an apples-to-apples comparison, but it’s measuring the impact of being assigned to a particular group, not the impact of being exposed to the feature. If the contamination is minor, this is “close enough”.

Instrumental Variables

But the canonical thing to do in this situation is Instrumental Variables (IV). I say “canonical” and not “right”, because in the examples I gave it’s actually still not perfect. I’ll get to that.

To motivate IV, consider why ITT fails. ITT measures the impact of assignment, not the impact of exposure. ITT measures the average impact across two groups: compliers who follow their test assignment (people in the holdout group never accidentally gain access), and always-takers (people with poor internet connections affected by the bug). In the holdout group, we know who the compliers and always-takers are. People in the holdout who accidentally gained access are always-takers, and everyone else is a complier. But in the treatment group, everyone has access. We don’t know who would have been affected by this bug, not for sure. But still, some fraction of the test group are compliers and the rest are always-takers.

The ITT impact is the average of the effect of assignment on the compliers and the always-takers. The impact of assignment on the always-takers is zero, since assignment does not actually influence exposure. (Actually, that’s not always right! In the second example I gave, the server was only down for a few hours. So the affected people in the holdout group were only exposed for a few hours, not for the whole test. So their exposure is still different from the exposure in the test group, and maybe brief exposure is closer to no exposure than to total exposure. In this situation, I actually think ITT is the better option. But in the first example, people with poor internet connections really were treated the same as people in the test group.)

The impact of assignment on the compliers is…the normal impact. So the overall impact is an average of the normal impact and zero: that’s too low. What IV does is adjust the ITT impact up a little bit to compensate: $$ \textrm{ITT} = \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers} + \textrm{Impact}_\textrm{always-takers} \cdot \pi_\textrm{always-takers}, $$ where $\pi_\textrm{compliers}$ and $\pi_\textrm{always-takers}$ are the proportion of compliers and always-takers in the test. Since everyone is either a complier or an always-taker, we have $\pi_\textrm{compliers} + \pi_\textrm{always-takers} = 1.$ Since we argued that $\textrm{Impact}_\textrm{always-takers} = 0$, this equation simplifies to $$ \begin{align} \textrm{ITT} &= \textrm{Impact}_\textrm{compliers} \cdot \pi_\textrm{compliers}, \textrm{ or} \\ \textrm{Impact}_\textrm{compliers} &= \frac{\textrm{ITT}}{\pi_\textrm{compliers}} \\ &= \frac{\textrm{ITT}}{1 - \pi_\textrm{always-takers}} \end{align} $$ That is, the impact of exposure on the compliers (the people not affected by the bug), is the ITT effect divided by the proportion of compliers. This formula is called the instrumental variables estimate. The last expression is perhaps the most illuminating: if the number of people affected by the bug is small, then we are dividing by a number that is close to one, and our change to the ITT estimate is minor.

What’s cool about this is we can estimate the impact among the the compliers even though we don’t know who in the test group is a complier. The key assumption was that the impact of assignment on the always-takers was exactly zero. This is known as the “exclusion restriction”, and it was reasonable in the slow internet example I gave and not in the server-down example. In the server-down example, I would argue the impact among the always-takers is not zero, and not the full impact (because the outcome among the always-takers in the holdout is somewhat inflated by the temporary access, assuming the feature is beneficial), but somewhere between. So the overall impact is somewhere between the ITT effect and the complier impact.

Note that we can only calculate the impact of the feature on the compliers. If the feature has different impact on different people, which is almost certainly the case, then we gain no insights about the impact of the feature on the always-takers: the people affected by the bug. But if the impact of the bug is minor, then there aren’t many always-takers, and we still get good insights.

Conclusions and Further Reading

It’s so easy for data scientists in tech companies to run A/B tests, than when something goes wrong, the temptation is to cancel the test and start over. But with minor contamination, that’s unnecessary. Both Intent-to-Treat and Instrumental Variables analyses can give us good insights about the test. Two other approaches, “as treated” and “per protocol” should be avoided. The scenarios discussed here are “one-sided noncompliance”, but sometimes we the noncompliance is two-sided. For more details on one- and two-sided noncompliance, see “Causal Inference for Statistics, Social, and Biomedical Sciences” by Guido Imbens and Donald Rubin.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.