Design-Based Inference and Sensitivity Analysis for Survey Sampling

· 27 min read · surveys

In this note, we consider sampling from a finite population, without replacement and with unequal probabilities. We seek an estimate of the population mean of some characteristic.

Design-Based Inference and Sensitivity Analysis for Survey Sampling

Principal Stratification and Mediation

· 38 min read · causal inference

This post explores principal stratification and mediation analysis as tools for understanding causal effects, decomposing them into direct and indirect components. It covers scenarios like non-compliance, missing outcomes, and surrogate indices, highlighting the importance of assumptions such as no direct effects and no Defiers. Practical methods, including multiple imputation, regression, and matching, are discussed for estimating effects even when key quantities are unobserved. Real-world examples, like marketing lift studies and product funnels, illustrate the relevance of these techniques for addressing complex causal questions.

Principal Stratification and Mediation

Interpretable and Validatable Uplift Modeling

· 27 min read · causal inference

In this note, we introduce a method for interpreting and validating the results of uplift modeling. We propose two novel strategies for controlling the Familywise Error Rate in this setting.

Interpretable and Validatable Uplift Modeling

Modes of Inference in Randomized Experiments

· 10 min read · causal inference

Randomization provides the “reasoned basis for inference” in an experiment. Yet some approaches to analyzing experiments ignore the special structure of randomization. Simple, familiar approaches like regression models sometimes give wrong answers when applied to experiments. Approaches exploiting randomization deliver more reliable inferences than methods neglecting it. Randomization inference should be the first method we reach for when analyzing experiments.

Modes of Inference in Randomized Experiments

Sensitivity Analysis for Matched Sets with One Treated Unit

· 31 min read · causal inference

Adjusting for observed factors does not elevate an observational study to the reliability of an experiment. P-values are not appropriate measures of the strength of evidence in an observational study. Instead, sensitivity analysis allows us to identify the magnitude of hidden biases that would be necessary to invalidate study conclusions. This leads to a strength-of-evidence metric appropriate for an observational study.

Sensitivity Analysis for Matched Sets with One Treated Unit

Sensitivity Analysis for Matched Pairs

· 45 min read · causal inference

Observational studies involve more uncertainty than randomized experiments. Sensitivity analysis offers an approach to quantifying this uncertainty.

Sensitivity Analysis for Matched Pairs

Attributable Effects

· 14 min read

In a previous post, we discussed why randomization provides a reasoned basis for inference in an experiment. Randomization not only quantifies the plausibility of a causal effect but also allows us to infer something about the size of that effect.

Attributable Effects

The Reasoned Basis for Inference in Experiments

· 11 min read

In his 1935 book, “Design of Experiments”, Ronald Fisher described randomization as the “reasoned basis for inference” in an experiment. Why do we need a “basis” at all, let alone a reasoned one?

Tests with One-Sided Noncompliance

· 14 min read

Introduction Tech companies spoil data scientists. It’s so easy for us to A/B test everything. We can alter many aspects of the product from a configuration UI. We have the sample size to get a good read in as little as a few days. We have the data infrastructure to analyze and report results quickly.

Eglot+Tree-Sitter in Emacs 29

· 11 min read

I’ve been an Emacs user for about 15 years, and for the most part I use Emacs for org-mode and python development. I’ve happily used Jorgen Schäfer’s elpy as the core of my python development workflow for the last 5 years or so, and I’ve been happy with it. Unfortunately the current maintainer, Gaby Launay, hasn’t had time to work on elpy for over a year now. In one sense this doesn’t matter: elpy is pretty stable; it’s open source so it can’t just disappear on me; and I feel comfortable making minor changes myself.

Compiling Emacs 29 With Tree-Sitter

· 2 min read

I started a new job recently and took the opportunity to install a new version of Emacs. Emacs 29 includes tree-sitter and built-in eglot support, which I’ll write about some other time. In this post, I just want to document how I compiled Emacs on an M2 macos device.

User Segmentation from Heterogeneous Treatment Effects

· 4 min read

Imagine we are attempting to identify segments within an audience, perhaps so we can market to them more effectively through personalization. A common approach to doing so is to apply some kind of clustering algorithm (such as K-means) based on various user covariates. I have never been especially happy with this approach: the resulting clusters seem pretty arbitrary.

Heterogeneous Treatment Effect Estimation: Function Approximation

· 15 min read

A simple approach to heterogeneous treatment effect estimation relies on a difference in approximations to the outcome function among the two treatment groups. In this post, I derive the conditions under which this approach works.

Thoughts on Models with Regularization

· 7 min read

Lately I’ve been reflecting on regularization. Early in my data science career I spent some time working with generalized additive models, but I started focusing more and more on traditional statistical methods. I am re-finding the value in regularization and expect to use more of it going forward.

The Winner's Curse: Why it Happens and What to Do About It

· 14 min read

When running an A/B test with many variants, say, more than 5, we often run into a phenomenon known as the Winner’s Curse, where the winning variant performs worse when we adopt it universally than it had during the test itself. In this post, we discuss why this phenomenon occurs and what to do about it.

Reflections on 2021 and Interests Going Into 2022

· 4 min read

As 2021 wrapped up, I’ve been reflecting on the past year and thinking about the next. I was similarly reflective this time last year, when I wrote about how 2020, for me, was the Year of Emacs.

Robust Portfolio Optimization in Models with Diminishing Returns

· 11 min read

In our last post, we discussed how model uncertainty poses a risk when allocating resources among productive assets. In this post, we expand the discussion to models with diminishing returns. Such models are common in economics. As before, we can incorporate model uncertainty directly into the problem, achieving good performance regardless of the true model, with minimal impact to nominal performance. Robust optimization is both powerful and practical.

Robust Portfolio Optimization in Generalized Linear Models

· 8 min read

Often we run an A/B test in order to inform some decision. But every A/B test involves uncertainty, no matter the sample size. This uncertainty poses a risk to our decision, which can be hedged by a process analogous to diversifying an investment portfolio. Finding a robust-optimal portfolio is both practical and fast.

Focus on Iteration Speed

· 11 min read

The OODA loop (Observe-Orient-Decide-Act) framework was developed by USAF Colonel John Boyd to improve fighter pilot performance in the field. We can apply a similar framework to improving the efficiency with which we develop data science models. The key insight is to embrace the iterative nature of model development and streamline each component of these iterations.

The Alternative to Causal Inference is Worse Causal Inference

· 5 min read

Some of the most important questions data scientists investigate are causal questions. They’re also some of the hardest to answer! A well-designed A/B test often provides the cleanest answer, but when a test is infeasible, there are plenty of other causal inference techniques that may be useful. While not perfect, these techniques are much better than the alternative: ad hoc methods with no logical foundation.

Bayesian A/B Testing Considered Harmful

· 7 min read

In science we study physically meaningful quantities that have some kind of objective reality, and that means that multiple people should draw substantively equivalent conclusions. But in some situations, this principle is at odds with the Bayesian Coherency Principle, and so we have to choose between internal consistency, or consistency with external reality.

Edgeworth Series in Python

· 4 min read

We often use distributions that can be reasonably approximated as Gaussian, typically due to the Central Limit Theorem. When the sample size is large (and the tails of the distribution are reasonable), the approximation is really good and there’s no point worrying about it. But with modest sample sizes, or if the underlying distribution is heavily skewed, the approximation may not be good enough.

Testing with Many Variants

· 10 min read

This is a long drive for someone with nothing to think about.

Robust Power Assessment

· 6 min read

An important part of planning any statistical experiment is power analysis. In this post I will focus on power analysis for linear regression models, but I am hopeful much of this can be applied to Generalized Linear Models and hence to the sorts of A/B tests I normally run.

Scheffe's Method for Multiple Comparisons

· 9 min read

I’ve written previously about using the Bonferroni correction for the multiple comparisons problem. While it is without a doubt the simplest way to correct for multiple comparisons, it is not the only way. In this post, I discuss ScheffĂ©’s method for constructing simultaneous confidence intervals on arbitrarily many functions of the model parameters.

Supervised Learning as Function Approximation

· 4 min read

Supervised learning is perhaps the most central idea in Machine Learning. It is equally central to statistics where it is known as regression. Statistics formulates the problem in terms of identifying the distribution from which observations are drawn; Machine Learning in terms of finding a model that fits the data well.

Naive Comparisons Under Endogeneity

· 9 min read

Recently I have been reading Causal Inference: The Mixtape by Scott Cunningham. One thing I think Cunningham explains very well is the role of endogeneity in confounding even simple comparisons. I don’t have a background in economics, so I had never really grokked the concepts of endogenous and exogenous factors, especially as it related to causal inference. In this post, I’m going to discuss a few examples that highlight why it’s such an important distinction.

Advice for Early Career Data Scientists

· 6 min read

Coming out of college, I had some ideas about how I was going to become successful and what my career was going to look like. Of course, I was all wrong. Here is the advice I would offer a young me.

Multiple Comparisons

· 6 min read

The simplest kind of A/B test compares two options, using a single KPI to decide which option is best. The more general theory of statistical experiment design easily handles more options and more metrics, provided we know how to incorporate the multiple comparisons involved. To see why this is important, read on!

Violations of the Stable Unit Treatment Value Assumption

· 4 min read

We have previously mentioned the Stable Unit Treatment Value Assumption, or SUTVA, a complicated-sounding term that is one of the most important assumptions underlying A/B testing (and Causal Inference in general). In this post, we talk a little more about it and why it is so important.

2020: My Year in Emacs

· 7 min read

Other than the very fabric of society being torn apart, and other than the silver lining of getting to spend so much time with my wife and 2 year old daughter, the big theme of 2020 for me personally was Emacs.

Statistics and Machine Learning: Better Together!

· 6 min read

My masters degree focused on Machine Learning, but when I got my first job as a data scientist, I quickly realized there was a lot I still needed to learn about Statistics. Since then I have come to appreciate the nuanced differences between Statistics and Machine Learning and I’m convinced they have a lot to offer one another!

Contingency Tables Part IV: The Score Test

· 11 min read

The score test can be used to calculate p-values and confidence intervals for A/B tests. The score test considers the slope of the likelihood function at the parameter value associated with the null hypothesis.

Thoughts on Principal Components Analysis

· 4 min read

This is a post with more questions than answers. I’ve been thinking about Principal Components Analysis (PCA) lately.

Sprinkle some Maximum Likelihood Estimation on that Contingency Table!

· 9 min read

Maximum Likelihood Estimation provides consistent estimators, and can be efficiently computed under many null hypotheses of practical interest.

Contingency Tables Part II: The Binomial Distribution

· 9 min read

In our last post, we introduced the potential outcomes framework as the foundational framework for causal inference. In the potential outcomes framework, each unit (e.g. each person) is represented by a pair of outcomes, corresponding to the result of the experience provided to them (treatment or control, A or B, etc.)

Contingency Tables Part I: The Potential Outcomes Framework

· 8 min read

“Why can’t I take the results of an A/B test at face value? Who are you, the statistics mafia? I don’t need a PhD in statistics to know that one number is greater than another.” If this sounds familiar, it is helpful to remember that we do an A/B test to learn about different potential outcomes. Comparing potential outcomes is essential for smart decision making, and this framework is the cornerstone of causal inference.

Unshackle Yourself from Statistical Significance

· 5 min read

Don’t be a prisoner to statistical significance. A/B testing should serve the business, not the other way around!

Commit Message Linting with Magit

· 6 min read

I have a confession to make. I’ve been writing bad commit messages for years. It takes time to write good commit messages, and often I’m in a hurry. Or so I tell myself. But that’s a false dichotomy. I can have my cake and eat it too! Recently I discovered how to use magit to enforce best practices for commit messages.

Viterbi Algorithm, Part 2: Decoding

· 10 min read

This is my second post describing the Viterbi algorithm. As before, our presentation follows Jurafsky and Martin closely, merely filling in some details omitted in the text.

Viterbi Algorithm, Part 1: Likelihood

· 7 min read

The Viterbi algorithm is used to find the most likely sequence of states given a sequence of observations emitted by those states and some details of transition and emission probabilities. It has applications in Natural Language Processing like part-of-speech tagging, in error correction codes, and more!

Minimum Edit Distance

· 7 min read

Minimum Edit Distance is defined as the minimum number of edits (delete, insert, replace) needed to transform a source string to a target string. The algorithm uses dynamic programming both to calculate the minimum edit distance and to identify a corresponding sequence of edits.

Getting Things Done: Projects List and Next Actions

· 5 min read

Lately I’ve been practicing David Allen’s “Getting Things Done” framework, which consists of components for getting tasks out of your head and into a system to improve productivity and reduce stress. I wrote about the overall system here. In this post, I want to talk about my Projects list and my Next Actions agenda.

Spinning up PostgreSQL in Docker for Easy Analysis

· 4 min read

My typical analysis workflow is to start with data in some kind of database, perhaps Redshift or Snowflake. Often I’m working with millions or even billions of rows, but modern databases excel at operating with data at scale. Moreover, SQL is an intuitive and powerful tool for combining, filtering, and aggregating data. I’ll often do as much as I can in SQL, aggregate the data as much as I can, then export the data as a CSV to continue more advanced statistical calculations in python.

Timekeeping with Emacs and Org-Mode

· 5 min read

Although I have been an Emacs user for 15 years, for the first 13 of those years I only used a handful of commands and one or two “modes”. A couple years ago I went through the Emacs tutorial (within Emacs, type C-h r) to see if I was missing anything useful. I was not disappointed! Since that time, I have gone through the entire Emacs manual, made full use of Elpy to create a rich Python IDE, adopted Magit to speed up my version control workflow, and more!

A/B Testing Best Practices

· 9 min read

When I started this blog, my primary objective was less about teaching others A/B testing and more about clarifying my own thoughts on A/B testing. I had been running A/B tests for about a year, and I was starting to feel uncomfortable with some of the standard methodologies. It’s pretty common to use Student’s t-test to analyze A/B tests for example. One of the assumptions underlying that test is that the distributions are Gaussian. “What about A/B testing is Gaussian?”, I wondered. I knew there was a big difference between one-sided and two-sided tests, but I didn’t feel confident in my ability to choose the right one. And the multiple comparisons problem seemed to rear its ugly head at every turn: what was the best way to handle this?

Getting Things Done

· 8 min read

Getting Things Done or GTD is a productivity framework introduced by David Allen. Since his book was first published in 2001, the paradigm has achieved something of a cult status, especially among Emacs users. In this post I will describe my very-much-in-progress implementation of these systems.

Object Detection with Deep Learning

· 7 min read

One of the most interesting topics in the Coursera Deep Learning specialization is the “YOLO” algorithm for object detection. I often find it helpful to describe algorithms in my own words to solidify my understanding, and that is precisely what I will do here. Readers likely will prefer the original paper and its sequel.

Thoughts on the Coursera Deep Learning Specialization

· 2 min read

I recently completed the Deep Learning specialization on Coursera from deeplearning.ai. Over five courses, they go over generic neural networks, regularization, convolutional neural nets, and recurrent neural nets. Having completed it, I would say the specialization is a great overview, and a jumping off point for learning more about particular techniques. I wouldn’t say I have an in-depth understanding of all the material, but I do feel like I could go off and read papers and understand them, which is maybe all I could expect.

Distribution of Local Minima in Deep Neural Networks

· 5 min read

The “unreasonable effectiveness of deep learning” has been much discussed. Namely, as the cost function is non-convex, any optimization procedure will in general find a local, non-global, minimum. Actually, algorithms like gradient descent will terminate (perhaps because of early stopping) before even reaching a local minimum. For many experts in optimization, this seems like a bad thing. Concretely, it seems like the performance of networks trained in this way would be much worse than other optimization-based systems where we are in fact able to find the global minimum, such as logistic regression.

Computer Vision Cheat Sheet

· 2 min read

I am currently working through Convolutional Neural Networks, the fourth course in the Coursera specialization on Deep Learning. The first week of that course contains some hard-to-remember equations about filter sizes and padding and striding and I thought it would be helpful for me to write it out for future reference.

Deep Learning Checklist

· 8 min read

Recently I started the Deep Learning Specialization on Coursera. While I studied neural networks in my masters program (from Andrew Ng himself!), that was a long time ago and the field has changed considerably since then. I am supplementing the course by reading Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, which I will refer to as GBC16.

Repeatability

· 3 min read

As businesses continue to invest in data-driven decision making, it becomes increasingly important to ensure the methods underlying those decisions are reliable. Unfortunately, we cannot take this for granted! Read on to learn a collection of best practices to make sure your decision making process rests on a sturdy foundation.

Optimal Experiment Design

· 3 min read

We can plan sample sizes to control the width of confidence intervals.

Optimal Experiment Design

Three Goals of Statistics: Description, Prediction, and Prescription

· 4 min read

The great successes of Machine Learning in recent years are based on our ability to extrapolate and predict based on data. The next big step is learning and leveraging the relationship between cause and effect to prescribe what action to take.

Three Goals of Statistics: Description, Prediction, and Prescription

Rotations, Orientations, and their Representations

· 16 min read

Orientations pose an interesting challenge in polymorphism. Let’s implement a library in Rust!

Rotations, Orientations, and their Representations

Confidence Intervals

· 11 min read

Statistical analysis is not complete without an estimate of residual uncertainty.

Confidence Intervals

Rotational Axis Theorem (JIM)

· 10 min read

The Rotational Axis Theorem allows us to decompose the dynamics of complicated systems into simpler components.

Rotational Axis Theorem (JIM)

Statistical Power

· 13 min read

Power considerations drive the sample sizes needed for a successful experiment.

Statistical Power

Counterfactuals and Causal Reasoning

· 8 min read

What does ‘Why?’ mean anyway?

Counterfactuals and Causal Reasoning

Fisher's Exact Test

· 7 min read

Simulation-based inference sits on a rigorous foundation.

Fisher's Exact Test

A/B Testing, Part 2: Statistical Significance

· 11 min read

Results can’t always be taken at face value.

A/B Testing, Part 2: Statistical Significance

A/B Testing, Part 1: Random Segmentation

· 14 min read

Random segmentation is the gold standard of Causal Inference.

A/B Testing, Part 1: Random Segmentation