Shallow Neural Networks
Shallow neural networks represent piecewise linear functions.
By shallow, we mean that they have only one hidden layer.
However, given enough hidden units in each layer, shallow networks can approximate arbitrary relationships to any desired degree of accuracy.
Shallow networks also support multi-dimensional input and output.
Table of contents
Limitations of Regression
- Linear regression can only describe linear relationships.
- Does not support multi-dimensional output.
From Input Layer to Hidden Layer
One bias unit is added to the input layer, and each unit in the input layer is weighted by a parameter and fed into a hidden layer.
We call the following the pre-activation:
$$ \biasvec_0 + \weightmat_0 \inputx $$
Multi-Dimensional Input
When the input is univariate, the network is a piecewise linear function.
If there are two features, the network is a piecewise plane.
Number of Parameters from Input Layer to Hidden Layer
The book denotes the dimension of the input as:
$$ D_i $$
Here $i$ stands for input.
- $\biasvec_0$ is the bias vector of dimension $D \times 1$.
- $\weightmat_0$ is the weight matrix $D \times D_i$.
Thus, the number of parameters in the hidden layer is:
$$ (D_i + 1) \cdot D $$
Activation Function
Book denotes activation function as:
$$ \mathbf{a}[\bullet] $$
Rectified Linear Unit (ReLU)
One of the most common type of activation function is the ReLU.
For univariate pre-activation $z$:
$$ a[z] = \text{ReLU}(z) = \max(0, z) $$
Basically, zeroes out negative values. It extends to multi-dimensional input as well.
Other Activation Functions
Purpose of Activation Functions
What ReLU is doing is basically a high-pass filter. If you have a nonnegative value, it will let you pass (your contribution to the network is activated), otherwise, you are blocked/deactivated.
By turning on and off some parts of the computation, activation functions introduce non-linearity to the model.
The above figures are the pre-activations and activations from each hidden unit.
The activations will then be weighted and combined to form the output units.
Without the activation function, the network will collapse to a linear model.
Hidden Layer
Hidden layer is the layer between the input and output layers.
In each hidden layer, there are hidden units ($h_1, h_2, h_3$ in the figure).
Do not confuse hidden unit notation $h_d$ with hidden layer notation $\hidden_k$.
Each hidden unit has an activation function to be applied to the pre-activations.
The book uses $K$ to denote the number of hidden layers, and $D_k$ to denote the number of hidden units in each layer.
$$ \hidden_k = \activate \left[ \biasvec_{k-1} + \weightmat_{k-1} \inputx \right] $$
Shallow
By shallow, we mean that the network has only one hidden layer.
In a shallow network, $K = 1$,
\[\hidden_1 = \activate \left[ \biasvec_0 + \weightmat_0 \inputx \right]\]Hidden Units
The book denotes the number of hidden units in the $k$-th layer as:
$$ D_k $$
Number of hidden units is sometimes called the capacity.
To Output Layer
Similar to how the input units were weighted, combined, and activated, the results of the hidden units are again weighted and combined to form the output units.
The only difference is that the output units do not go through an activation function.
The output layer is:
$$ \outputy = \biasvec_K + \weightmat_K \hidden_K $$
In a shallow network,
\[\outputy = \biasvec_1 + \weightmat_1 \hidden_1\]Multi-Dimensional Output
When the output is univariate, the network is a single piecewise function (linear, plane, etc.).
If there are more than one output, the network is a collection of piecewise functions.
Each piecewise function would have the same joints and regions as the single output case, however their scales would be different due to different weights.
Number of Parameters from Hidden Layer to Output Layer
The book denotes the dimension of the output as:
$$ D_o $$
Here $o$ stands for output.
- $\biasvec_K$ is the bias vector of dimension $D_o \times 1$.
- $\weightmat_K$ is the weight matrix $D_o \times D_k$.
Thus, the number of parameters in the output layer is:
$$ (D_k + 1) \cdot D_o $$
Total Number of Parameters in Shallow NN
- Input layer to hidden layer: $(D_i + 1) \cdot D_1$
- Hidden layer to output layer: $(D_1 + 1) \cdot D_o$
Overview of Shallow Neural Networks
Which hidden units are activated for the region?
Look at the grey region in the figure. The activated hidden units for this region are $h_1$ and $h_3$.
Universal Approximation Theorem
Remember that more hidden units $\Rightarrow$ more activation $\Rightarrow$ more joints (flexibility).
Therefore, the universal approximation theorem states that with enough hidden units, a shallow neural network can approximate any continuous function to arbitrary precision.
So, in theory, increasing the width of the network is enough to approximate any function.