Focus on Iteration Speed
Table of Contents
Introduction
I make absolutely no attempt to create a great data science model when starting on a new project. When starting on a new project, I know I’m going to have to iterate a lot before the project will be a success. So I’m much more focused on positioning myself to iterate quickly and easily than I am on creating a great model right out of the gate.
This approach is inspired by the “ OODA loop” developed by USAF Colonel John Boyd. “OODA” stands for Observe-Orient-Decide-Act and refers to a typical iteration a fighter pilot works through in the field. The pilot Observes their surroundings, Orients themselves to figure out what those conditions mean to them, Decides on a course of action, and then Acts, repeating this core loop over and over. Boyd’s central insight was that the faster this OODA loop, the more successful the pilot will be. If the pilot’s OODA loop is meaningfully faster than their opponent’s, that’s a huge advantage.
I have found this concept to be essential when developing data science models. No one creates a great data science model on the first try. Model development is fundamentally an iterative process. The faster these iterations, the faster success will be achieved.
For a data scientist, a typical iteration starts with some exploratory data analysis, culminating in some hypothesis (I mean that in the general sense of the word, not necessarily in the “statistical hypothesis testing” sense) about the phenomenon under study, translating that hypothesis into a modeling strategy, and writing code to implement that strategy. Assessing performance turns into another round of exploratory data analysis, and the cycle begins again. I’ll call this an EHMI loop (Explore, Hypothesize, Model, Implement) for the purposes of this article, but when I talk about this topic with my co-workers, I typically still call it an OODA loop.
Iteration speed does more than determine the time to success; it also determines whether this process is fun or soul-sucking. When the EHMI loop is fast and frictionless, developing data science models is fun. When the process is slow, we end up twiddling our thumbs while we wait for data to load or models to fit which kills our momentum and prevents a state of flow. When our tools are clumsy or our code is messy, data exploration is error prone and our changes are likely to introduce distracting bugs. We’re forced to focus on the tool or the code rather than the goal, which forces us to context switch each iteration, causing cognitive overload.
Over time, I have developed habits to streamline each aspect of the EHMI loop. These tips may be helpful in your work, but my most important guidance is to identify the recurrent pain points slowing down your iteration time and eliminate them.
You need to be able to run your code really easily.
This seems obvious but I often see data scientist struggle to run their code end to end. I think the flexibility of notebooks actually interferes with this: people lose track of which order the cells need to be run in and just get themselves in a mess. So I do all my development “at the command line”. Actually, I do all my development directly in Emacs, and run my code as a test case, which I can run directly from within Emacs. My code typically looks like:
import datetime
import json
import logging
import os
import subprocess
def test_integration():
argv = []
main(argv)
def main(argv=None):
config = load_config(argv)
configure_logging(config)
data = load_data(config)
model = fit_model(data, config)
do_something_with(model)
def load_config(argv=None):
raise NotImplementedError("To Be Discussed")
def configure_logging(config):
raise NotImplementedError("To Be Discussed")
def load_data(config):
raise NotImplementedError("To Be Discussed")
def fit_model(data, config):
raise NotImplementedError("To Be Discussed")
def do_something_with(model):
raise NotImplementedError("To Be Discussed")
if __name__ == "__main__":
main()
The very first function is a test case I can run with pytest. Within
elpy, I just hit C-c C-t
. The
argv
bit will make more sense in a moment; but the main
function loads
config details, configures logging, loads data, fits a model, and does
something with the model. I write basically this same script over and over
again, so I use
skeletor to save
this as a template. Skeletor makes it really simple to create a new project,
install the needed python packages in a virtual environment, and initialize a
git repository. One of these days I’ll write more details about how I use Emacs
for python development, but for now let’s stay focused on how I speed up my
EHMI loop.
You need to be able to run the code in different ways really easily.
I use a combination of command line args and hard-coded config values for this.
My load_config
function looks something like:
def load_config(argv=None):
import argparse
parser = argparse.ArgumentParser(description="OODA")
parser.add_argument(
"--description",
"Brief description for log and model files",
)
args = parser.parse_args(argv)
config = {
"description": args.description,
# Command line args or defaults
# and hard-coded config values
}
return config
Since the load_config
file takes argv
as an argument, passing it to
parser.parse_args
, I’m able to specify the command line options in my test
case. So my test case might look more like:
def test_integration():
argv = [
"--description",
"early_development",
]
main(argv)
Again, this makes it really easy to quickly run the code in different ways. The command line inputs I use vary based on the task at hand, but often I have flags for whether to load data from a SQL query or from a file (see below), whether to load a saved model, whether to perform feature selection when training a model, etc.
You need good logging.
You need to be able to look back at an earlier iteration of your code and see what the results were. This is really helpful so I can quickly compare model performance across iterations. Less obviously, sometimes I get some weird result that I don’t notice at first. When I notice it, I’m trying to figure out when that weirdness started, and being able to look at older logs saves me a lot of head-banging.
I like to log the latest git commit SHA and any changes in my working directory. That way I can figure out exactly what code produced what result. I also like to include an optional short description in the filename because I end up having a lot of these log files and it can be hard to find the one I’m looking for.
def configure_logging(config):
log_dir_name = os.path.join(
os.path.dirname(__file__),
"..",
"logs",
)
if not os.path.exists(log_dir_name):
os.makedirs(log_dir_name)
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
if config.get("description", None) is None:
description = ""
else:
description = "_" + config["description"]
log_fn = os.path.join(
log_dir_name,
f"{ts}{description}.log",
)
handlers = [
logging.FileHandler(log_fn),
logging.StreamHandler(),
]
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=handlers,
)
sha = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode("utf-8")
logging.info(f"Commit SHA of HEAD: {sha}")
diff = subprocess.check_output(
["git", "diff"]
).decode("utf-8").strip()
logging.info("Uncommitted changes:\n" + diff)
logging.info(
"Configuration:\n"
+ json.dumps(config, indent=2)
)
The bit about handlers logs both to standard error (as it normally would) as well as to the log file. When running in pytest, standard error is only printed when there is an error, so my test case really looks like:
def test_integration(capsys):
argv = [...]
with capsys.disabled():
main(argv)
That way I see the results immediately, and I also have them in a log file for later perusal.
You need to be able to load data quickly.
My data source is typically some complicated SQL query. I like to do as much data cleaning and aggregating in SQL as I can, since I can write correct SQL much faster than I can write correct Pandas. (Despite having used Pandas at least weekly for the last 5 years, I still find the syntax to be hard to remember. I find SQL to be simple and obvious, and whatever servers are running my query are typically more powerful than my laptop.)
But I also certainly do not want to run the same query twice. So I’ll save my dataset as a CSV or (less frequently) as a Parquet file. I’ll also typically create a much smaller sampled dataset consisting of a random 1%, say, of the original dataset. Parquet files can be much faster to load tabular data than CSV, and a sampled dataset is more than adequate for early exploration and development. Remember: I don’t care at all about model performance early on. And it feels great to see the improved performance when I eventually switch to the full dataset.
Notebook enthusiasts are wondering right now if I’m crazy, because they load the data into memory once and never have to worry about it again. But when I use notebooks (and to be clear, I do often use notebooks, but not for anything nontrivial), I sometimes need to restart the kernel for whatever reason which clears the memory. So even when using a notebook, I still save my dataset and reload from file.
You should start with the simplest possible model.
For me that’s typically either linear regression or logistic regression in statsmodels (I never use scikit-learn for this sort of thing). I typically don’t do any kind of feature selection initially, though I typically will implement a weird hybrid of forward selection and cross validation eventually. I hope to write about my approach to feature selection in a future article!
A simple model is going to fit fast and supports a quick sanity check on my data inputs. This is even faster than visualizing my input data, which is also a good early step. But I typically don’t do any data visualization early on b/c I find matplotlib and even seaborn to be too inconvenient. I typically only do this if I’m trying to debug something I think is going wrong with the dataset.
Start quantifying model performance asap.
The sooner you start tracking your model performance, the sooner you can start improving it. I typically use either mean-squared error or cross entropy loss. I also really like the idea of “explained variance”: we calculate the error of a model that makes the same prediction for all observations (i.e. the overall average outcome), and then report the percent reduction in error relative to this trivial baseline. The simple model from the previous section will often be an impressive improvement over this baseline; the final model will typically be a humblingly small additional improvement.
Clean up your code.
Messy code is harder to debug and modify. Tools like black and flake8 make code cleanliness automatic. I use both, but for different reasons.
I use black to reformat code. I run it frequently, and one of these days I will configure Emacs to run it automatically when saving a file. (I don’t have a good reason for not doing so, other than running it manually is like adding an egg to boxed cake mix: it makes me feel like I’m doing something.) I use flake8 to highlight unused variables, overly complicated functions, and other small errors I’m nevertheless grateful to have pointed out to me.
This one might seem counter to my “speed up your iterations” theme, but just like with your apartment, it’s easier to clean as you go than it is to wait for everything to be a mess before refactoring. I also find this to be a good filler task when I’m stuck on something: my subconscious keeps working on the problem and by the time I’m done refactoring, I’m often unstuck.
Master your IDE.
Whether you are using Jupyter notebooks, Pycharm, or Emacs: learn how to use your tools. People sometimes try to debate me on why VS Code is better than Emacs, but then those same people don’t know how to run black in VS Code. ( You can; my point is that it’s weird when I have to teach the VS Code zealots how to use VS Code.)
Being able to:
- Navigate code
- Look up documentation
- Format/lint
- Run test cases
- Version control,
ideally without your hands leaving the keyboard, will help get you in that state of flow and help you iterate faster. This is all possible in Emacs, and I expect the same is possible in any serious IDE (PyCharm, VS Code, maybe even Jupyter). So I’m not saying you need to use Emacs, I’m saying you should learn how to use your IDE.
Summary
Getting to state-of-the-art (or even “good enough”) model performance requires iteration. This process can be fun and efficient, or it can be laborious and soul-sucking. The difference is how fast your iterations are. A typical iteration (or EHMI loop) involves Exploration, developing a Hypothesis, translating that hypothesis into a Model, and Implementing. As you work through this cycle, think about the pain points and eliminate anything that is slowing you down or preventing you from getting into a state of flow.
Ask yourself:
- How long does it take to run the code?
- How long does it take to load and manipulate the dataset?
- How long does it take to fit the model?
- Can I easily track model performance across iterations?
- How much time does it take to modify my code? How much time am I spending debugging my code?
- How much time am I spending fighting the IDE or other tools?
Developing good habits and leveraging code templates like I have shared above mean that streamlining development doesn’t take any extra time or effort, and can dramatically shorten the time to success.