Shallow Neural Networks

Shallow Terminology

Shallow neural networks represent piecewise linear functions.

By shallow, we mean that they have only one hidden layer.

However, given enough hidden units in each layer, shallow networks can approximate arbitrary relationships to any desired degree of accuracy.

Shallow networks also support multi-dimensional input and output.

Table of contents
  1. Limitations of Regression
  2. From Input Layer to Hidden Layer
    1. Multi-Dimensional Input
    2. Number of Parameters from Input Layer to Hidden Layer
  3. Activation Function
    1. Rectified Linear Unit (ReLU)
    2. Other Activation Functions
    3. Purpose of Activation Functions
  4. Hidden Layer
    1. Shallow
    2. Hidden Units
  5. To Output Layer
    1. Multi-Dimensional Output
    2. Number of Parameters from Hidden Layer to Output Layer
  6. Total Number of Parameters in Shallow NN
  7. Overview of Shallow Neural Networks
  8. Universal Approximation Theorem

Limitations of Regression

  • Linear regression can only describe linear relationships.
  • Does not support multi-dimensional output.
\[y = f[x, \phi] = \phi_0 + \phi_x x\]

From Input Layer to Hidden Layer

One bias unit is added to the input layer, and each unit in the input layer is weighted by a parameter and fed into a hidden layer.

We call the following the pre-activation:

$$ \biasvec_0 + \weightmat_0 \inputx $$

Multi-Dimensional Input

When the input is univariate, the network is a piecewise linear function.

If there are two features, the network is a piecewise plane.

Number of Parameters from Input Layer to Hidden Layer

The book denotes the dimension of the input as:

$$ D_i $$

Here $i$ stands for input.

  • $\biasvec_0$ is the bias vector of dimension $D \times 1$.
  • $\weightmat_0$ is the weight matrix $D \times D_i$.

Thus, the number of parameters in the hidden layer is:

$$ (D_i + 1) \cdot D $$


Activation Function

Book denotes activation function as:

$$ \mathbf{a}[\bullet] $$

Rectified Linear Unit (ReLU)

One of the most common type of activation function is the ReLU.

For univariate pre-activation $z$:

$$ a[z] = \text{ReLU}(z) = \max(0, z) $$

ReLU

Basically, zeroes out negative values. It extends to multi-dimensional input as well.

Other Activation Functions

Shallow Activation

Purpose of Activation Functions

What ReLU is doing is basically a high-pass filter. If you have a nonnegative value, it will let you pass (your contribution to the network is activated), otherwise, you are blocked/deactivated.

By turning on and off some parts of the computation, activation functions introduce non-linearity to the model.

Shallow Activation Example

The above figures are the pre-activations and activations from each hidden unit.

The activations will then be weighted and combined to form the output units.

Without the activation function, the network will collapse to a linear model.


Hidden Layer

Hidden layer is the layer between the input and output layers.

In each hidden layer, there are hidden units ($h_1, h_2, h_3$ in the figure).

Do not confuse hidden unit notation $h_d$ with hidden layer notation $\hidden_k$.

Each hidden unit has an activation function to be applied to the pre-activations.

Multi-IO Shallow NN

The book uses $K$ to denote the number of hidden layers, and $D_k$ to denote the number of hidden units in each layer.

$$ \hidden_k = \activate \left[ \biasvec_{k-1} + \weightmat_{k-1} \inputx \right] $$

Shallow

By shallow, we mean that the network has only one hidden layer.

In a shallow network, $K = 1$,

\[\hidden_1 = \activate \left[ \biasvec_0 + \weightmat_0 \inputx \right]\]

Hidden Units

The book denotes the number of hidden units in the $k$-th layer as:

$$ D_k $$

Number of hidden units is sometimes called the capacity.


To Output Layer

Similar to how the input units were weighted, combined, and activated, the results of the hidden units are again weighted and combined to form the output units.

The only difference is that the output units do not go through an activation function.

The output layer is:

$$ \outputy = \biasvec_K + \weightmat_K \hidden_K $$

In a shallow network,

\[\outputy = \biasvec_1 + \weightmat_1 \hidden_1\]

Multi-Dimensional Output

When the output is univariate, the network is a single piecewise function (linear, plane, etc.).

If there are more than one output, the network is a collection of piecewise functions.

Each piecewise function would have the same joints and regions as the single output case, however their scales would be different due to different weights.

Number of Parameters from Hidden Layer to Output Layer

The book denotes the dimension of the output as:

$$ D_o $$

Here $o$ stands for output.

  • $\biasvec_K$ is the bias vector of dimension $D_o \times 1$.
  • $\weightmat_K$ is the weight matrix $D_o \times D_k$.

Thus, the number of parameters in the output layer is:

$$ (D_k + 1) \cdot D_o $$


Total Number of Parameters in Shallow NN

  • Input layer to hidden layer: $(D_i + 1) \cdot D_1$
  • Hidden layer to output layer: $(D_1 + 1) \cdot D_o$

Overview of Shallow Neural Networks

Shallow Build Up
Which hidden units are activated for the region?

Look at the grey region in the figure. The activated hidden units for this region are $h_1$ and $h_3$.


Universal Approximation Theorem

Remember that more hidden units $\Rightarrow$ more activation $\Rightarrow$ more joints (flexibility).

Shallow Approximation

Therefore, the universal approximation theorem states that with enough hidden units, a shallow neural network can approximate any continuous function to arbitrary precision.

So, in theory, increasing the width of the network is enough to approximate any function.