Supervised Learning as Function Approximation
Recently I’ve been reading, “Mostly Harmless Econometrics” by Angrist and Pischke. Like any good book, it is making me re-examine things I thought I understood for years in a new light.
Supervised learning is perhaps the most central idea in Machine Learning. It is equally central to statistics where it is known as regression. Statistics formulates the problem in terms of identifying the distribution from which observations are drawn; Machine Learning in terms of finding a model that fits the data well.
As a statistician I sometimes wonder, where do the probabilities come from? (I don’t believe randomness exists outside of quantum mechanics—probability is just a convenient mathematical fiction.) As a student of Machine Learning, we might as well ask, why doesn’t the model fit the data perfectly? While I’m familiar with the standard answers to both questions, Mostly Harmless Econometrics gave me some ideas about an alternative approach.
Suppose we want to understand what causes some phenomenon, which could be a person’s decision to buy something, the amount of fruit a plum tree produces, or anything. We assume that the outcome $y$ is a deterministic function of observed factors $x$ and potentially unobserved factors $z$, that is, $y = f(x, z)$.
Since we do not observe $z$, we would like to identify an approximation to $f$ that is a function of $x$ alone. Define $$ g^\ast(x) = \underset{g \in \mathcal{G}}{\operatorname{arg\,min}} \int_\mathcal{X} \int_\mathcal{Z} D(g(x), f(x, z)) \cdot w(x, z) \, dz \, dx, $$ where $\mathcal{X} \times \mathcal{Z}$ is the domain of $f$, $D$ is some distance or loss function, $w$ is some non-negative weighting function satisfying $\int_\mathcal{X} \int_\mathcal{Z} w(x, z) \, dz \, dx = 1$, and $\mathcal{G}$ is some family of functions with support on $\mathcal{X}$.
Note that $w(x, z)$ can be interpreted as a probability density function corresponding to a distribution we’ll call $\mathcal{F}$. Suppose we observe $n$ samples derived from this distribution: $\left\{ \left( x^{(i)}, \, f\left(x^{(i)}, z^{(i)}\right) \right) \right\}_{i=1}^n$. The notation is intended to make it clear that we do not directly observe $z^{(i)}$, but we do observe the resulting $y^{(i)} = f\left(x^{(i)}, z^{(i)}\right)$. Note that there is no error term in this model: $y^{(i)}$ is determined completely and exactly by $x^{(i)}$ and $z^{(i)}$—we just don’t observe $z^{(i)}$.
Define $$ \hat{g}_n(x) = \underset{g \in \mathcal{G}}{\operatorname{arg\,min}} \frac{1}{n} \sum_{i=1}^n D\left(g\left(x^{(i)}\right), \, y^{(i)}\right). $$ This should be very familiar to statisticians and Machine Learning researchers alike: $\hat{g}_n(x)$ is the sort of model that we fit to data $\left\{ \left( x^{(i)}, \, y^{(i)} \right) \right\}$ based on some loss function $D$ such as squared error or cross entropy loss. We will show that $\lim_{n \to \infty} \hat{g}_n(x) = g^\ast(x)$.
For any particular observation, $$ E \left[ D\left(g\left(x^{(i)}\right), \, y^{(i)}\right) \right] = \int_\mathcal{X} \int_\mathcal{Z} D(g(x), f(x, z)) \cdot w(x, z) \, dz \, dx, $$ so by the strong law of large numbers, $$ \frac{1}{n} \sum_{i=1}^n D\left(g\left(x^{(i)}\right), \, y^{(i)}\right) \overset{\textrm{a.s.}}{\longrightarrow} \int_\mathcal{X} \int_\mathcal{Z} D(g(x), f(x, z)) \cdot w(x, z) \, dz \, dx, $$ where a.s. denotes almost sure convergence. Since in the limit as $n \to \infty$, the objectives leading to $g^\ast$ and $\hat{g}_n$ are the same, $\hat{g}_n \overset{\textrm{a.s.}}{\longrightarrow} g^\ast$.
This shows one interpretation of what we are doing in supervised learning: finding an approximation to a function $f$ based on a subset of the information determining $f$.
This explains where the probabilities come from: they’re just a way of connecting a dataset to the integrals we would need to evaluate. It also explains three reasons the resulting model may not fit the data perfectly. First, we may not observe all of the relevant factors determining $y$. Second, we may be limiting ourselves by adhering to a family of functions $\mathcal{G}$, if that family does not resemble $f$. For example, if $\mathcal{G}$ is the space of linear functions, and $f$ has important nonlinearities, $g^\ast$ and $\hat{g}_n$ will be poor approximations to $f$. Third, with a finite dataset, the sample average of the loss function may not closely approximate the integrals.
In 2021, it seems like we can always get more data, and by using a suitably large family $\mathcal{G}$, such as the space of deep neural networks, the second and third reasons can be mitigated. But if there are important unobserved factors, no amount of data, and no amount of deep learning, can achieve perfect performance.