Convolutional Neural Networks

Table of contents

Characteristics of an Image
Invariance and Equivariance
Convolution
Kernel / Filter
1. Kernel Size, Stride, Padding, Dilation
RELU Activation
Feature Maps
1. Channel
Number of Parameters
Pooling / Downsampling
1. Max Pooling
Fully Connected Layer
1. Batch Normalization

Characteristics of an Image

High dimensionality
Spatial correlation: nearby patches are highly related
Should be invariant to transformations: a horizontally flipped cat is still a cat

Regular neural networks do not take advantage or consider these characteristics.

Invariance and Equivariance

A model $f$ is invariant to a transformation $T$ if:

$$ f(T(x)) = f(x) $$

A flipped cat is still a cat.

A model $f$ is equivariant to a transformation $T$ if:

$$ f(T(x)) = T(f(x)) $$

Think about real-time motion detection. A person captured in one frame should still be detected in the next frame by the same transformation that the person underwent.

Convolution

What does a hidden layer do in a neural network? Each hidden unit in a layer is a linear combination of the input features.

In a fully connected layer, the hidden layer just calculates every such possible combinations.

But in an image, not all pixels are related to each other. For example, usally the top left pixel is not related to the bottom right pixel, and there really is no point in learning all these combinations.

In addition, the input feature space is too large for a fully connected layer causing the number of parameters to explode.

Usually a meaningful object or feature is constrained to a small region or patch of the image. We want to learn the combination of pixels in that patch.

So we apply convolution to the image with a kernel.

Kernel / Filter

A kernel is a patch of a certain size that slides over the image, and each hidden unit will learn the combination of the pixels in that patch.

These kernels are efficient because their weights are shared across the image.

The weights are shared across the single channel of the image.

Kernel Size, Stride, Padding, Dilation

There are four hyperparameters that control the output size of a convolution (the number of hidden units in the next layer):

Kernel size (K): the size of the patch (usually square)
Stride (S): the number of pixels the kernel moves each time
Padding (P): the number of pixels added to the border of the image
Dilation (D): the number of pixels skipped in the kernel

Let the width and height of the input image be $W$ and $H$ respectively. The output size is calculated as:

$$ \begin{gather*} W_{\text{out}} = \left\lfloor \frac{W + 2P - D(K - 1) - 1}{S} \right\rfloor + 1 \\ H_{\text{out}} = \left\lfloor \frac{H + 2P - D(K - 1) - 1}{S} \right\rfloor + 1 \end{gather*} $$

RELU Activation

After each convolution, we apply the RELU activation function.

Feature Maps

Each kernel calculates a certain combination of pixels in the patch. We often think of each kernel as a feature detector with a certain purpose like detecting edges or corners.

So each kernel produces whether the feature it is looking for is present in the patch.

We have many different features that we look for in an image, such as edges, corners, and textures.

So we have many kernels that slide over the image to produce many feature maps.

Channel

The number of feature maps is called the channel of the layer. This number is equal to the number of kernels used in the convolution.

Number of Parameters

The number of parameters in a convolutional layer is:

$$ ((W \cdot H \cdot C_{\text{in}}) + 1) \cdot C_{\text{out}} $$

$W$: the width of the kernel
$H$: the height of the kernel
$C_{\text{in}}$: the number of channels in the input
$C_{\text{out}}$: the number of channels in the output

The $+1$ is for the bias term, which is of size $C_{\text{out}}$.

Pooling / Downsampling

Pooling is a technique to downsample the feature maps. Pooling layers also have a size and stride.

A size of 2 and stride of 2 will reduce the size of the feature map by half.

Max Pooling

Although no longer used in modern architectures, max pooling was a popular pooling technique.

It takes the maximum value in the patch.

Fully Connected Layer

Convolutional layers are usually followed by fully connected layers with dropout and batch normalization layers in between.

Batch Normalization

Given a set of minibatch activations $h_1, h_2, \ldots, h_B$, we normalize, each activation to have zero mean and unit variance:

\[\begin{gather*} \mu = \frac{1}{B} \sum_{i=1}^B h_i \\[1em] \sigma^2 = \frac{1}{B} \sum_{i=1}^B (h_i - \mu)^2 \\[1em] h_i \leftarrow \frac{h_i - \mu}{\sigma + \epsilon} \end{gather*}\]

Then we add in two learnable parameters $\gamma$ and $\beta$, so that the normalized activations can be rescaled and shifted by the network:

\[h_i \leftarrow \gamma h_i + \beta\]