Posts
Design-Based Inference and Sensitivity Analysis for Survey Sampling
In this note, we consider sampling from a finite population, without replacement and with unequal probabilities. We seek an estimate of the population mean of some characteristic.
Principal Stratification and Mediation
This post explores principal stratification and mediation analysis as tools for understanding causal effects, decomposing them into direct and indirect components. It covers scenarios like non-compliance, missing outcomes, and surrogate indices, highlighting the importance of assumptions such as no direct effects and no Defiers. Practical methods, including multiple imputation, regression, and matching, are discussed for estimating effects even when key quantities are unobserved. Real-world examples, like marketing lift studies and product funnels, illustrate the relevance of these techniques for addressing complex causal questions.
Interpretable and Validatable Uplift Modeling
In this note, we introduce a method for interpreting and validating the results of uplift modeling. We propose two novel strategies for controlling the Familywise Error Rate in this setting.
Modes of Inference in Randomized Experiments
Randomization provides the “reasoned basis for inference” in an experiment. Yet some approaches to analyzing experiments ignore the special structure of randomization. Simple, familiar approaches like regression models sometimes give wrong answers when applied to experiments. Approaches exploiting randomization deliver more reliable inferences than methods neglecting it. Randomization inference should be the first method we reach for when analyzing experiments.
Sensitivity Analysis for Matched Sets with One Treated Unit
Adjusting for observed factors does not elevate an observational study to the reliability of an experiment. P-values are not appropriate measures of the strength of evidence in an observational study. Instead, sensitivity analysis allows us to identify the magnitude of hidden biases that would be necessary to invalidate study conclusions. This leads to a strength-of-evidence metric appropriate for an observational study.
Sensitivity Analysis for Matched Pairs
Observational studies involve more uncertainty than randomized experiments. Sensitivity analysis offers an approach to quantifying this uncertainty.
Attributable Effects
In a previous post, we discussed why randomization provides a reasoned basis for inference in an experiment. Randomization not only quantifies the plausibility of a causal effect but also allows us to infer something about the size of that effect.
The Reasoned Basis for Inference in Experiments
In his 1935 book, “Design of Experiments”, Ronald Fisher described randomization as the “reasoned basis for inference” in an experiment. Why do we need a “basis” at all, let alone a reasoned one?
Tests with One-Sided Noncompliance
Introduction Tech companies spoil data scientists. It’s so easy for us to A/B test everything. We can alter many aspects of the product from a configuration UI. We have the sample size to get a good read in as little as a few days. We have the data infrastructure to analyze and report results quickly.
Eglot+Tree-Sitter in Emacs 29
I’ve been an Emacs user for about 15 years, and for the most part I use Emacs for org-mode and python development. I’ve happily used Jorgen Schäfer’s elpy as the core of my python development workflow for the last 5 years or so, and I’ve been happy with it. Unfortunately the current maintainer, Gaby Launay, hasn’t had time to work on elpy for over a year now. In one sense this doesn’t matter: elpy is pretty stable; it’s open source so it can’t just disappear on me; and I feel comfortable making minor changes myself.
Compiling Emacs 29 With Tree-Sitter
I started a new job recently and took the opportunity to install a new version of Emacs. Emacs 29 includes tree-sitter and built-in eglot support, which I’ll write about some other time. In this post, I just want to document how I compiled Emacs on an M2 macos device.
User Segmentation from Heterogeneous Treatment Effects
Imagine we are attempting to identify segments within an audience, perhaps so we can market to them more effectively through personalization. A common approach to doing so is to apply some kind of clustering algorithm (such as K-means) based on various user covariates. I have never been especially happy with this approach: the resulting clusters seem pretty arbitrary.
Heterogeneous Treatment Effect Estimation: Function Approximation
A simple approach to heterogeneous treatment effect estimation relies on a difference in approximations to the outcome function among the two treatment groups. In this post, I derive the conditions under which this approach works.
Thoughts on Models with Regularization
Lately I’ve been reflecting on regularization. Early in my data science career I spent some time working with generalized additive models, but I started focusing more and more on traditional statistical methods. I am re-finding the value in regularization and expect to use more of it going forward.
The Winner's Curse: Why it Happens and What to Do About It
When running an A/B test with many variants, say, more than 5, we often run into a phenomenon known as the Winner’s Curse, where the winning variant performs worse when we adopt it universally than it had during the test itself. In this post, we discuss why this phenomenon occurs and what to do about it.
Reflections on 2021 and Interests Going Into 2022
As 2021 wrapped up, I’ve been reflecting on the past year and thinking about the next. I was similarly reflective this time last year, when I wrote about how 2020, for me, was the Year of Emacs.
Robust Portfolio Optimization in Models with Diminishing Returns
In our last post, we discussed how model uncertainty poses a risk when allocating resources among productive assets. In this post, we expand the discussion to models with diminishing returns. Such models are common in economics. As before, we can incorporate model uncertainty directly into the problem, achieving good performance regardless of the true model, with minimal impact to nominal performance. Robust optimization is both powerful and practical.
Robust Portfolio Optimization in Generalized Linear Models
Often we run an A/B test in order to inform some decision. But every A/B test involves uncertainty, no matter the sample size. This uncertainty poses a risk to our decision, which can be hedged by a process analogous to diversifying an investment portfolio. Finding a robust-optimal portfolio is both practical and fast.
Focus on Iteration Speed
The OODA loop (Observe-Orient-Decide-Act) framework was developed by USAF Colonel John Boyd to improve fighter pilot performance in the field. We can apply a similar framework to improving the efficiency with which we develop data science models. The key insight is to embrace the iterative nature of model development and streamline each component of these iterations.
The Alternative to Causal Inference is Worse Causal Inference
Some of the most important questions data scientists investigate are causal questions. They’re also some of the hardest to answer! A well-designed A/B test often provides the cleanest answer, but when a test is infeasible, there are plenty of other causal inference techniques that may be useful. While not perfect, these techniques are much better than the alternative: ad hoc methods with no logical foundation.
Bayesian A/B Testing Considered Harmful
In science we study physically meaningful quantities that have some kind of objective reality, and that means that multiple people should draw substantively equivalent conclusions. But in some situations, this principle is at odds with the Bayesian Coherency Principle, and so we have to choose between internal consistency, or consistency with external reality.
Edgeworth Series in Python
We often use distributions that can be reasonably approximated as Gaussian, typically due to the Central Limit Theorem. When the sample size is large (and the tails of the distribution are reasonable), the approximation is really good and there’s no point worrying about it. But with modest sample sizes, or if the underlying distribution is heavily skewed, the approximation may not be good enough.
Testing with Many Variants
This is a long drive for someone with nothing to think about.
Robust Power Assessment
An important part of planning any statistical experiment is power analysis. In this post I will focus on power analysis for linear regression models, but I am hopeful much of this can be applied to Generalized Linear Models and hence to the sorts of A/B tests I normally run.
Scheffe's Method for Multiple Comparisons
I’ve written previously about using the Bonferroni correction for the multiple comparisons problem. While it is without a doubt the simplest way to correct for multiple comparisons, it is not the only way. In this post, I discuss ScheffĂ©’s method for constructing simultaneous confidence intervals on arbitrarily many functions of the model parameters.
Supervised Learning as Function Approximation
Supervised learning is perhaps the most central idea in Machine Learning. It is equally central to statistics where it is known as regression. Statistics formulates the problem in terms of identifying the distribution from which observations are drawn; Machine Learning in terms of finding a model that fits the data well.
Naive Comparisons Under Endogeneity
Recently I have been reading Causal Inference: The Mixtape by Scott Cunningham. One thing I think Cunningham explains very well is the role of endogeneity in confounding even simple comparisons. I don’t have a background in economics, so I had never really grokked the concepts of endogenous and exogenous factors, especially as it related to causal inference. In this post, I’m going to discuss a few examples that highlight why it’s such an important distinction.
Advice for Early Career Data Scientists
Coming out of college, I had some ideas about how I was going to become successful and what my career was going to look like. Of course, I was all wrong. Here is the advice I would offer a young me.
Multiple Comparisons
The simplest kind of A/B test compares two options, using a single KPI to decide which option is best. The more general theory of statistical experiment design easily handles more options and more metrics, provided we know how to incorporate the multiple comparisons involved. To see why this is important, read on!
Violations of the Stable Unit Treatment Value Assumption
We have previously mentioned the Stable Unit Treatment Value Assumption, or SUTVA, a complicated-sounding term that is one of the most important assumptions underlying A/B testing (and Causal Inference in general). In this post, we talk a little more about it and why it is so important.
2020: My Year in Emacs
Other than the very fabric of society being torn apart, and other than the silver lining of getting to spend so much time with my wife and 2 year old daughter, the big theme of 2020 for me personally was Emacs.
Statistics and Machine Learning: Better Together!
My masters degree focused on Machine Learning, but when I got my first job as a data scientist, I quickly realized there was a lot I still needed to learn about Statistics. Since then I have come to appreciate the nuanced differences between Statistics and Machine Learning and I’m convinced they have a lot to offer one another!
Contingency Tables Part IV: The Score Test
The score test can be used to calculate p-values and confidence intervals for A/B tests. The score test considers the slope of the likelihood function at the parameter value associated with the null hypothesis.
Thoughts on Principal Components Analysis
This is a post with more questions than answers. I’ve been thinking about Principal Components Analysis (PCA) lately.
Sprinkle some Maximum Likelihood Estimation on that Contingency Table!
Maximum Likelihood Estimation provides consistent estimators, and can be efficiently computed under many null hypotheses of practical interest.
Contingency Tables Part II: The Binomial Distribution
In our last post, we introduced the potential outcomes framework as the foundational framework for causal inference. In the potential outcomes framework, each unit (e.g. each person) is represented by a pair of outcomes, corresponding to the result of the experience provided to them (treatment or control, A or B, etc.)
Contingency Tables Part I: The Potential Outcomes Framework
“Why can’t I take the results of an A/B test at face value? Who are you, the statistics mafia? I don’t need a PhD in statistics to know that one number is greater than another.” If this sounds familiar, it is helpful to remember that we do an A/B test to learn about different potential outcomes. Comparing potential outcomes is essential for smart decision making, and this framework is the cornerstone of causal inference.
Unshackle Yourself from Statistical Significance
Don’t be a prisoner to statistical significance. A/B testing should serve the business, not the other way around!
Commit Message Linting with Magit
I have a confession to make. I’ve been writing bad commit messages for years. It takes time to write good commit messages, and often I’m in a hurry. Or so I tell myself. But that’s a false dichotomy. I can have my cake and eat it too! Recently I discovered how to use magit to enforce best practices for commit messages.
Viterbi Algorithm, Part 2: Decoding
This is my second post describing the Viterbi algorithm. As before, our presentation follows Jurafsky and Martin closely, merely filling in some details omitted in the text.
Viterbi Algorithm, Part 1: Likelihood
The Viterbi algorithm is used to find the most likely sequence of states given a sequence of observations emitted by those states and some details of transition and emission probabilities. It has applications in Natural Language Processing like part-of-speech tagging, in error correction codes, and more!
Minimum Edit Distance
Minimum Edit Distance is defined as the minimum number of edits (delete, insert, replace) needed to transform a source string to a target string. The algorithm uses dynamic programming both to calculate the minimum edit distance and to identify a corresponding sequence of edits.
Getting Things Done: Projects List and Next Actions
Lately I’ve been practicing David Allen’s “Getting Things Done” framework, which consists of components for getting tasks out of your head and into a system to improve productivity and reduce stress. I wrote about the overall system here. In this post, I want to talk about my Projects list and my Next Actions agenda.
Spinning up PostgreSQL in Docker for Easy Analysis
My typical analysis workflow is to start with data in some kind of database, perhaps Redshift or Snowflake. Often I’m working with millions or even billions of rows, but modern databases excel at operating with data at scale. Moreover, SQL is an intuitive and powerful tool for combining, filtering, and aggregating data. I’ll often do as much as I can in SQL, aggregate the data as much as I can, then export the data as a CSV to continue more advanced statistical calculations in python.
Timekeeping with Emacs and Org-Mode
Although I have been an Emacs user for 15 years, for the first 13 of those years I only used a handful of commands and one or two “modes”. A couple years ago I went through the Emacs tutorial (within Emacs, type C-h r) to see if I was missing anything useful. I was not disappointed! Since that time, I have gone through the entire Emacs manual, made full use of Elpy to create a rich Python IDE, adopted Magit to speed up my version control workflow, and more!
A/B Testing Best Practices
When I started this blog, my primary objective was less about teaching others A/B testing and more about clarifying my own thoughts on A/B testing. I had been running A/B tests for about a year, and I was starting to feel uncomfortable with some of the standard methodologies. It’s pretty common to use Student’s t-test to analyze A/B tests for example. One of the assumptions underlying that test is that the distributions are Gaussian. “What about A/B testing is Gaussian?”, I wondered. I knew there was a big difference between one-sided and two-sided tests, but I didn’t feel confident in my ability to choose the right one. And the multiple comparisons problem seemed to rear its ugly head at every turn: what was the best way to handle this?
Getting Things Done
Getting Things Done or GTD is a productivity framework introduced by David Allen. Since his book was first published in 2001, the paradigm has achieved something of a cult status, especially among Emacs users. In this post I will describe my very-much-in-progress implementation of these systems.
Object Detection with Deep Learning
One of the most interesting topics in the Coursera Deep Learning specialization is the “YOLO” algorithm for object detection. I often find it helpful to describe algorithms in my own words to solidify my understanding, and that is precisely what I will do here. Readers likely will prefer the original paper and its sequel.
Thoughts on the Coursera Deep Learning Specialization
I recently completed the Deep Learning specialization on Coursera from deeplearning.ai. Over five courses, they go over generic neural networks, regularization, convolutional neural nets, and recurrent neural nets. Having completed it, I would say the specialization is a great overview, and a jumping off point for learning more about particular techniques. I wouldn’t say I have an in-depth understanding of all the material, but I do feel like I could go off and read papers and understand them, which is maybe all I could expect.
Distribution of Local Minima in Deep Neural Networks
The “unreasonable effectiveness of deep learning” has been much discussed. Namely, as the cost function is non-convex, any optimization procedure will in general find a local, non-global, minimum. Actually, algorithms like gradient descent will terminate (perhaps because of early stopping) before even reaching a local minimum. For many experts in optimization, this seems like a bad thing. Concretely, it seems like the performance of networks trained in this way would be much worse than other optimization-based systems where we are in fact able to find the global minimum, such as logistic regression.
Computer Vision Cheat Sheet
I am currently working through Convolutional Neural Networks, the fourth course in the Coursera specialization on Deep Learning. The first week of that course contains some hard-to-remember equations about filter sizes and padding and striding and I thought it would be helpful for me to write it out for future reference.
Deep Learning Checklist
Recently I started the Deep Learning Specialization on Coursera. While I studied neural networks in my masters program (from Andrew Ng himself!), that was a long time ago and the field has changed considerably since then. I am supplementing the course by reading Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, which I will refer to as GBC16.
Repeatability
As businesses continue to invest in data-driven decision making, it becomes increasingly important to ensure the methods underlying those decisions are reliable. Unfortunately, we cannot take this for granted! Read on to learn a collection of best practices to make sure your decision making process rests on a sturdy foundation.
Optimal Experiment Design
We can plan sample sizes to control the width of confidence intervals.
Three Goals of Statistics: Description, Prediction, and Prescription
The great successes of Machine Learning in recent years are based on our ability to extrapolate and predict based on data. The next big step is learning and leveraging the relationship between cause and effect to prescribe what action to take.
Rotations, Orientations, and their Representations
Orientations pose an interesting challenge in polymorphism. Let’s implement a library in Rust!
Confidence Intervals
Statistical analysis is not complete without an estimate of residual uncertainty.
Rotational Axis Theorem (JIM)
The Rotational Axis Theorem allows us to decompose the dynamics of complicated systems into simpler components.
Statistical Power
Power considerations drive the sample sizes needed for a successful experiment.
Counterfactuals and Causal Reasoning
What does ‘Why?’ mean anyway?
Fisher's Exact Test
Simulation-based inference sits on a rigorous foundation.
A/B Testing, Part 2: Statistical Significance
Results can’t always be taken at face value.
A/B Testing, Part 1: Random Segmentation
Random segmentation is the gold standard of Causal Inference.
















