The Alternative to Causal Inference is Worse Causal Inference
A few months ago, Andrew Gelman had a post on his blog talking about how “the alternative to good statistics is not no statistics, it’s bad statistics.” (Gelman was in turn quoting the baseball analyst Bill James.)
That idea really resonated with me. Some of the most important questions data scientists investigate are causal questions. They’re also some of the hardest questions to answer! When we can do a structured A/B test, that provides the cleanest, more reliable estimate of the consequences of a particular decision. But that isn’t always possible. And when we can’t or don’t run an A/B test, often those questions don’t just go away. People still want to know the impact of a particular decision, and in the absence of some data-driven estimate, people are often all too willing to substitute methods of questionable reliability.
Here’s an example. Back when I was at Tinder, we released a new feature that we hoped would be truly revolutionary for the business. (I’d rather not give specifics on what it was.) At the time, I was just beginning to learn about A/B testing and was definitely not up to speed on any of the more exotic approaches to causal inference. We knew we couldn’t A/B test the release because of network effects. Even if we did release it to some people and not others, the people in the two groups would talk to one another, the holdouts would be mad they didn’t get to use the new feature, the rollouts would be mad we were treating them like guinea pigs, whatever. So we didn’t do any kind of randomized study.
Of course, the executives still wanted to know the impact of the feature on user engagement, on growth, on monetization. Really anything to validate that this feature was as impactful as everyone who worked so hard on it wanted it to be. The idea was floated to compare the retention and time spent on the app between people who used the feature and people who didn’t, but I knew just enough about causal inference to know that would give a biased picture. There are plenty of reasons why a person would be more likely to use the new feature, and also be more likely to have better retention and engagement; we wouldn’t be able to conclude the feature itself was the reason for improved engagement.
I wish I could tell you we did something really sophisticated and saved the day and everyone called us heroes. In fact, all we did was compare the engagement before and after the release, and did some deep dives on how people who used the feature differed from people who didn’t use the feature. I’m actually really proud of the work we did then. It was the most thorough analysis of any aspect of Tinder we ever did in my time there, it just didn’t answer the most important questions.
And that experience, more than any other, is what taught me how important Causal Inference is. Nothing I had learned in my MS program about random forests or neural networks or collaborative filtering helped me in that situation. (Andrew Ng’s lecture on Simpson’s Paradox is what saved me from comparing engagement between people who used the feature and people who didn’t.) I knew that I needed to become an expert in Causal Inference in order to help businesses get value out of data.
What would I do differently today, many years later? I would have worked with the engineers to include a little popup tutorial on the new feature when people open the app after upgrading. I would A/B test the popup: some people would see it and others wouldn’t. I would measure the impact of the popup on usage of the feature, and the impact of the popup on engagement, retention, whatever. It’s not plausible that the popup itself would impact engagement, except via a lift in usage of the feature. Thus, the popup provides a perfect “instrument” for the feature, without having to restrict usage of the feature at all. The method of instrumental variables allows us to combine the impact of the popup on feature usage and on engagement in such a way that we can estimate the impact of the feature itself on engagement. Honestly, it’s a nice clean way to estimate the impact of something you aren’t willing to A/B test. (Mostly Harmless Econometrics by Angrist and Pischke is a great introduction to Instrumental Variables and other Causal Inference techniques.)
This isn’t perfect either! Instrumental Variables makes assumptions like homogeneous treatment effects, the exclusion restriction, etc. And it only works if the instrument (the popup in this example) really does make a big impact on the thing being tested (using the feature). Most of the techniques of causal inference have untestable assumptions, caveats, limitations. Data scientists may feel a little squeamish using such techniques. But the alternative is not no inference, it’s just worse inference. If we don’t do something like this, people (inexperienced data scientists or desperate product managers) are just going to compare before/after, or compare people who use the feature and people who don’t. That’s just worse causal inference.
Some of the most important questions we try to answer are causal in nature. Causal inference is hard, perhaps impossible to do perfectly. But there are better and worse ways of doing it, and even if we can’t do it perfectly, we should still do it as best we can, and we should keep looking for better ways to do it. It’s that important.
P.S. I am a little embarrassed when I look back on that experience. I could do so much more, so much better today. But I guess that’s what growth looks like.