Computer Vision Cheat Sheet
I am currently working through Convolutional Neural Networks, the fourth course in the Coursera specialization on Deep Learning. The first week of that course contains some hard-to-remember equations about filter sizes and padding and striding and I thought it would be helpful for me to write it out for future reference.
Filters are typically square, having size $f$, and we typically have a filter “depth” which corresponds to the number of channels, $n_C$. The input image has dimensions $n_H \times n_W \times n_C$. Sometimes we pad the image with $p$ zeros on all sides, effectively increasing the image to be $(n_H + 2p) \times (n_W + 2p) \times n_C$ (obviously we don’t pad on the channels). We often “stride” by more than one, meaning as we slide the filter around the image, we move more than one row or column as we go. The stride is denoted by $s$.
Convolutions without padding ($p=0$) are called “valid”. Convolutions with padding $\frac{f-1}{2}$ are called “same” because the resulting output size will be the same as the input size.
When we apply a filter with size $f$, padding $p$, and stride $s$, to an image of size $n_H \times n_W$, the resulting output height is $\left\lfloor \frac{n_H+2p-f}{s} + 1 \right\rfloor$, and the width is the same, replacing $n_H$ with $n_W$.
When applying a convolutional layer to an input with dimensions $n_H \times n_W \times n_C$, with $n_f$ filters having size $f$, the number of parameters is $n_f \cdot (f^2 \cdot n_C + 1)$. (The +1 corresponds to the bias term, which is the same across the filter.)