Convolutional Neural Networks
Table of contents
Characteristics of an Image
- High dimensionality
- Spatial correlation: nearby patches are highly related
- Should be invariant to transformations: a horizontally flipped cat is still a cat
Regular neural networks do not take advantage or consider these characteristics.
Invariance and Equivariance
A model $f$ is invariant to a transformation $T$ if:
$$ f(T(x)) = f(x) $$
A flipped cat is still a cat.
A model $f$ is equivariant to a transformation $T$ if:
$$ f(T(x)) = T(f(x)) $$
Think about real-time motion detection. A person captured in one frame should still be detected in the next frame by the same transformation that the person underwent.
Convolution
What does a hidden layer do in a neural network? Each hidden unit in a layer is a linear combination of the input features.
In a fully connected layer, the hidden layer just calculates every such possible combinations.
But in an image, not all pixels are related to each other. For example, usally the top left pixel is not related to the bottom right pixel, and there really is no point in learning all these combinations.
In addition, the input feature space is too large for a fully connected layer causing the number of parameters to explode.
Usually a meaningful object or feature is constrained to a small region or patch of the image. We want to learn the combination of pixels in that patch.
So we apply convolution to the image with a kernel.
Kernel / Filter
A kernel is a patch of a certain size that slides over the image, and each hidden unit will learn the combination of the pixels in that patch.
These kernels are efficient because their weights are shared across the image.
The weights are shared across the single channel of the image.
Kernel Size, Stride, Padding, Dilation
There are four hyperparameters that control the output size of a convolution (the number of hidden units in the next layer):
- Kernel size (K): the size of the patch (usually square)
- Stride (S): the number of pixels the kernel moves each time
- Padding (P): the number of pixels added to the border of the image
- Dilation (D): the number of pixels skipped in the kernel
Let the width and height of the input image be $W$ and $H$ respectively. The output size is calculated as:
$$ \begin{gather*} W_{\text{out}} = \left\lfloor \frac{W + 2P - D(K - 1) - 1}{S} \right\rfloor + 1 \\ H_{\text{out}} = \left\lfloor \frac{H + 2P - D(K - 1) - 1}{S} \right\rfloor + 1 \end{gather*} $$
RELU Activation
After each convolution, we apply the RELU activation function.
Feature Maps
Each kernel calculates a certain combination of pixels in the patch. We often think of each kernel as a feature detector with a certain purpose like detecting edges or corners.
So each kernel produces whether the feature it is looking for is present in the patch.
We have many different features that we look for in an image, such as edges, corners, and textures.
So we have many kernels that slide over the image to produce many feature maps.
Channel
The number of feature maps is called the channel of the layer. This number is equal to the number of kernels used in the convolution.
Number of Parameters
The number of parameters in a convolutional layer is:
$$ ((W \cdot H \cdot C_{\text{in}}) + 1) \cdot C_{\text{out}} $$
- $W$: the width of the kernel
- $H$: the height of the kernel
- $C_{\text{in}}$: the number of channels in the input
- $C_{\text{out}}$: the number of channels in the output
The $+1$ is for the bias term, which is of size $C_{\text{out}}$.
Pooling / Downsampling
Pooling is a technique to downsample the feature maps. Pooling layers also have a size and stride.
A size of 2 and stride of 2 will reduce the size of the feature map by half.
Max Pooling
Although no longer used in modern architectures, max pooling was a popular pooling technique.
It takes the maximum value in the patch.
Fully Connected Layer
Convolutional layers are usually followed by fully connected layers with dropout and batch normalization layers in between.
Batch Normalization
Given a set of minibatch activations $h_1, h_2, \ldots, h_B$, we normalize, each activation to have zero mean and unit variance:
\[\begin{gather*} \mu = \frac{1}{B} \sum_{i=1}^B h_i \\[1em] \sigma^2 = \frac{1}{B} \sum_{i=1}^B (h_i - \mu)^2 \\[1em] h_i \leftarrow \frac{h_i - \mu}{\sigma + \epsilon} \end{gather*}\]Then we add in two learnable parameters $\gamma$ and $\beta$, so that the normalized activations can be rescaled and shifted by the network:
\[h_i \leftarrow \gamma h_i + \beta\]