Shallow Neural Networks

Shallow neural networks represent piecewise linear functions.

By shallow, we mean that they have only one hidden layer.

However, given enough hidden units in each layer, shallow networks can approximate arbitrary relationships to any desired degree of accuracy.

Shallow networks also support multi-dimensional input and output.

Table of contents

Limitations of Regression
From Input Layer to Hidden Layer
1. Multi-Dimensional Input
2. Number of Parameters from Input Layer to Hidden Layer
Activation Function
Hidden Layer
1. Shallow
2. Hidden Units
To Output Layer
1. Multi-Dimensional Output
2. Number of Parameters from Hidden Layer to Output Layer
Total Number of Parameters in Shallow NN
Overview of Shallow Neural Networks
Universal Approximation Theorem

Limitations of Regression

Linear regression can only describe linear relationships.
Does not support multi-dimensional output.

\[y = f[x, \phi] = \phi_0 + \phi_x x\]

From Input Layer to Hidden Layer

One bias unit is added to the input layer, and each unit in the input layer is weighted by a parameter and fed into a hidden layer.

We call the following the pre-activation:

$$ \biasvec_0 + \weightmat_0 \inputx $$

Multi-Dimensional Input

When the input is univariate, the network is a piecewise linear function.

If there are two features, the network is a piecewise plane.

Number of Parameters from Input Layer to Hidden Layer

The book denotes the dimension of the input as:

$$ D_i $$

Here $i$ stands for input.

$\biasvec_0$ is the bias vector of dimension $D \times 1$.
$\weightmat_0$ is the weight matrix $D \times D_i$.

Thus, the number of parameters in the hidden layer is:

$$ (D_i + 1) \cdot D $$

Activation Function

Book denotes activation function as:

$$ \mathbf{a}[\bullet] $$

Rectified Linear Unit (ReLU)

One of the most common type of activation function is the ReLU.

For univariate pre-activation $z$:

$$ a[z] = \text{ReLU}(z) = \max(0, z) $$

Basically, zeroes out negative values. It extends to multi-dimensional input as well.

Other Activation Functions

Purpose of Activation Functions

What ReLU is doing is basically a high-pass filter. If you have a nonnegative value, it will let you pass (your contribution to the network is activated), otherwise, you are blocked/deactivated.

By turning on and off some parts of the computation, activation functions introduce non-linearity to the model.

The above figures are the pre-activations and activations from each hidden unit.

The activations will then be weighted and combined to form the output units.

Without the activation function, the network will collapse to a linear model.

Hidden Layer

Hidden layer is the layer between the input and output layers.

In each hidden layer, there are hidden units ($h_1, h_2, h_3$ in the figure).

Do not confuse hidden unit notation $h_d$ with hidden layer notation $\hidden_k$.

Each hidden unit has an activation function to be applied to the pre-activations.

The book uses $K$ to denote the number of hidden layers, and $D_k$ to denote the number of hidden units in each layer.

$$ \hidden_k = \activate \left[ \biasvec_{k-1} + \weightmat_{k-1} \inputx \right] $$

Shallow

By shallow, we mean that the network has only one hidden layer.

In a shallow network, $K = 1$,

\[\hidden_1 = \activate \left[ \biasvec_0 + \weightmat_0 \inputx \right]\]

Hidden Units

The book denotes the number of hidden units in the $k$-th layer as:

$$ D_k $$

Number of hidden units is sometimes called the capacity.

To Output Layer

Similar to how the input units were weighted, combined, and activated, the results of the hidden units are again weighted and combined to form the output units.

The only difference is that the output units do not go through an activation function.

The output layer is:

$$ \outputy = \biasvec_K + \weightmat_K \hidden_K $$

In a shallow network,

\[\outputy = \biasvec_1 + \weightmat_1 \hidden_1\]

Multi-Dimensional Output

When the output is univariate, the network is a single piecewise function (linear, plane, etc.).

If there are more than one output, the network is a collection of piecewise functions.

Each piecewise function would have the same joints and regions as the single output case, however their scales would be different due to different weights.

Number of Parameters from Hidden Layer to Output Layer

The book denotes the dimension of the output as:

$$ D_o $$

Here $o$ stands for output.

$\biasvec_K$ is the bias vector of dimension $D_o \times 1$.
$\weightmat_K$ is the weight matrix $D_o \times D_k$.

Thus, the number of parameters in the output layer is:

$$ (D_k + 1) \cdot D_o $$

Total Number of Parameters in Shallow NN

Input layer to hidden layer: $(D_i + 1) \cdot D_1$
Hidden layer to output layer: $(D_1 + 1) \cdot D_o$

Overview of Shallow Neural Networks

Which hidden units are activated for the region?

Look at the grey region in the figure. The activated hidden units for this region are $h_1$ and $h_3$.

Universal Approximation Theorem

Remember that more hidden units $\Rightarrow$ more activation $\Rightarrow$ more joints (flexibility).

Therefore, the universal approximation theorem states that with enough hidden units, a shallow neural network can approximate any continuous function to arbitrary precision.

So, in theory, increasing the width of the network is enough to approximate any function.