Object Detection with Deep Learning
One of the most interesting topics in the Coursera Deep Learning specialization is the “YOLO” algorithm for object detection. I often find it helpful to describe algorithms in my own words to solidify my understanding, and that is precisely what I will do here. Readers likely will prefer the original paper and its sequel.
I’ll focus on the “v2” version of the algorithm, which attempts to generate a bounding box for an object and a class identifier. For example, in the context of autonomous driving, it might draw a bounding box around a traffic sign and label it as a stop sign.
Training data
The training data consist of images, some of which contain objects while others contain only background. Each image is divided into $S \times S$ grid cells. This is done to accommodate images containing multiple objects. By dividing the image into small enough regions, each cell will hopefully contain at most one object. While an object may span multiple cells, only the cell containing the midpoint of the object receives credit for it. We’ll use $S=13$ as our example. The authors of the YOLO paper recommend using odd $S$ so that the center of the image lies in the center of a cell rather than at the corner of multiple cells, since objects are often at the center of an image.
To make this even more robust, the algorithm uses “anchor boxes”. The idea is that even if there are two object midpoints in the same cell, likely the objects will have different aspect ratios. For example, a pedestrian is tall and skinny while a car is short and wide. The anchor boxes are chosen by applying $k$-means clustering to the bounding boxes in the training set, with $k=5$. Whichever anchor box has the most overlap with the object is responsible for detecting it.
The label for each cell and anchor box is a vector
$$ v = \left[ \begin{array}{ccccccc} p & b_x & b_y & b_w & b_h & \vec{c} \end{array} \right]^T, $$ where $p$ corresponds not only to the probability the cell contains an object but also its overlap with the predicted bounding box; $b_x$ and $b_y$ correspond to the $x$ and $y$ coordinates of the midpoint of the object; $b_w$ and $b_h$ are the width and height of the bounding box; and $\vec{c}$ is a vector representing the one-hot-encoded class label.
In greater detail, $p =: \sigma(t_o)$ has the interpretation of being the product of two terms: the probability the cell contains an object, and the Intersection-over-Union (IoU) of the predicting bounding box and the ground truth. The sigmoid function is used to ensure that $p$ is between 0 and 1.
The $x$ and $y$ coordinates of the object midpoint, $b_x$ and $b_y$, are expressed relative to the top-left of the cell, and as a fraction of the dimensions of the cell. Since these should be between 0 and 1, the sigmoid function is used. Thus, $b_x = \sigma(t_x)$ and similarly for $y$.
The width and height of the bounding box, $b_w$ and $b_h$, are expressed relative to the dimensions of the anchor box, $p_w$ and $p_h$: $b_w = p_w e^{t_w}$. and similarly for the height. Thus, $t_w$ and $t_h$ have the effect of scaling the anchor box dimensions.
Finally, $\vec{c}$ has the interpretation of being the probability of the object being of a particular class, conditional on there being an object. The overall class probability is the product of $p$ and $\vec{c}$.
Some examples may be helpful. Consider a cell that does not contain any objects of interest to the object detection system. Each anchor box of that cell would be represented by the vector $\left[\begin{array}{cccccc} 0 & ? & ? & ? & ? & \vec{?} \end{array}\right]^T$, where the $?$ symbol is a special “doesn’t matter” indicator. Such entries do not contribute to the cost function since when there is no object, location and class membership are meaningless.
Now consider a cell having a stop sign. Let’s say we only care about detecting stop signs and yield signs for simplicity, so $\left[\begin{array}{cc} 1 & 0 \end{array}\right]$ corresponds to a stop sign and $\left[\begin{array}{cc} 0 & 1 \end{array}\right]$ to a yield sign. Whichever anchor box has the most overlap with the stop sign would have label
$$ \left[\begin{array}{ccccccc} 1 & 0.25 & 0.1 & 0.15 & 0.17 & 1 & 0 \end{array}\right]^T $$
with entries indicating, in order, that the cell contains an object; that its midpoint is located 25% of the way from the left of the cell and 10% of the way down from the top of the cell; that its width is 15% of the width of the cell and its height 17% of the height of the cell; and that it is a stop sign. The other anchor boxes associated with the same cell would correspond to a “no object” label. But if there were multiple objects in the same cell, they might be captured by other anchor boxes.
The labels for the anchor boxes associated with each cell are concatenated to create a label in $\mathbb{R}^{5 \cdot (5 + |C|)}$, where $|C|$ is the number of classes. In our example, there are two classes so the cell label is in $\mathbb{R}^{35}$. The label for the entire image is represented by a rank-3 tensor with dimensions $S \times S \times 35$. The last dimension depends on the number of classes, but I like thinking in terms of particular numbers so I’ll use that as an example.
Training the model
The authors use a deep convolutional network (24 convolutional layers followed by 2 fully connected layers) pretrained on ImageNet. The authors used dropout in the original algorithm but replaced it with batch normalization in v2. In v2, the authors also leveraged a hierarchical labeling approach to augment the labels. They did this because detection data sets tend to have imprecise labels like “dog”, whereas classification data sets will often have very specific labels like “Boston terrier” and they needed a way to map the two. As a bonus, their algorithm can detect over 9000 types of objects!
Applying the model
When applying the model, we feed the image through the network, resulting in a predicted label in $\mathbb{R}^{13 \times 13 \times 35}$ (recall we are using a 13-by-13 image grid, 5 anchor boxes per cell, and 2 object classes). The label is actually several predictions in one: one candidate object in each cell and anchor box.
Typically we’ll have multiple predictions corresponding to the same object, so YOLO uses a method called “non-max suppression” to generate the final list of predictions. Basically, we identify overlapping bounding boxes and suppress all but the best one.
The way this works is iterative. First eliminate any low-confidence predictions, say where $p \cdot \max{\vec{c}} \leq 0.6$. For the remaining predictions, start with the one having the highest confidence, which we output as one of our final predictions. Then loop over all the other predictions and identify any bounding boxes that have a large amount of overlap with our original box, suppressing them. The suppression threshold is based on the IoU between the boxes: if the IoU is greater than 0.5 it will be suppressed. Then, amongst the remaining predictions, identify whichever has the highest confidence and repeat until all predictions have either been finalized or suppressed.
Other Approaches to Object Detection
There seem to be two families of Object Detection algorithms: regression methods and methods based on region proposals. YOLO is a regression method. In region proposal methods, we first identify regions likely to contain objects and then we classify them. Since it involves two steps, they tend to be slower than the one-step regression methods.
Girshik, et al, 2014 introduced the R-CNN (regions with convolutional neural networks) method, which was quite slow. Girshik, 2015 introduced the Fast R-CNN method, and Ren, et al, 2016 introduced the Faster R-CNN method.