Deep Neural Networks
Table of contents
Network Composition
Let’s take a look at a composition of two single-input, single-output shallow networks.
Define the first network as $f_1$ and the second network as $f_2$.
Then we can say $y’ = f_2(f_1(x))$.
Folding Analogy
Take Network 1 above as a piece of paper. If you were to fold the paper whenever there is a joint along the input axis, you would end up with three regions.
With the paper already folded, in each region, you are going to stretch, flip and scale Network 2 onto it to fit the size of the region.
Then whenever there is a joint on Network 2, you’re going to fold the paper again.
So you have three regions in Network 1, folding each everytime you see a new region in Network 2, will give you 9 regions in total.
Then for the final output, you unfold the paper and you have the final network.
Deep Neural Network
Instead of composition, we can have a single network with multiple layers and achieve the same result.
For example:
Notations
Organizing the notations from shallow neural networks:
- $K$: Depth of layers (number of hidden layers)
$D_k$: Width of layer $k$ (number of hidden units in layer $k$)
$K$ and $D_k$ are hyperparameters.
- $\inputx$: Input vector dimension $D_i$
First and $k$-th hidden layer
$$ \begin{align*} \hidden_1 &= \activate[\biasvec_0 + \weightmat_0 \inputx] \\[1em] \hidden_k &= \activate[\biasvec_{k-1} + \weightmat_{k-1} \hidden_{k-1}] \end{align*} $$
$\outputy$: Output vector dimension $D_o$
$$ \outputy = \biasvec_K + \weightmat_K \hidden_K $$
Capacity of the Network
The capacity of the network is the total number of hidden units:
$$ \sum_{k=1}^{K} D_k $$
Comparison to Shallow Networks
Both Can Approximate Any Function
Shallow networks are mimiced by deep networks where all the other hidden layers except the first are just identity functions.
So both shallow and deep networks obey the universal approximation theorem.
Deep Networks Have More Representational Power
For the same number of hidden units, deep networks have more representational power, meaning they create more parameters and linear regions.
For a single input-output network, having 6 hidden units in a single hidden layer creates $12 + 7 = 19$ parameters.
Splitting the 6 hidden units into 2 hidden layers with 3 hidden units each creates $6 + 4 + 4 + 6 = 20$ parameters.
In terms of linear regions, refer back to the folding analogy above.
The number of regions gets multiplied each layer, while increase in regions in a shallow network is additive. Therefore, deep networks generally create more linear regions.
Dependencies and Symmetries Among Regions
Deep layers are sequentially dependent on each other, and we can imagine from the folding analogy that each regions contain some symmetries.
So just because deep networks have more linear regions, we cannot always say this is desirable.
Whether we want this effect depends on the nature of the problem.
Depth Efficiency
We said that both shallow and deep networks can approximate any function.
However, there are some cases (not all) that require exponentially more hidden units in a shallow network than in a deep network.
Some tasks can be done with a few hundred hidden units in multiple layers, but require millions of hidden units in a single layer.
Again, whether this effect is desirable depends on the nature of the problem.
Large, Structured Inputs
Images, when flattened into a single vector, can easily have millions of input dimensions.
Even with a single layer, a fully connected network creates a real computational burden.
Also, in the case of images, we don’t always want to treat each pixel as an independent feature. Sometimes we want to lump together pixels and process them in local groups.
Deep networks such as convolutional neural networks (CNNs) can handle this by making the inputs go thorough multiple pooling and convolutional layers.
It is not feasible to do this with a single layer.
Training and Generalization
Upto a certain number of hidden layers, it is easier to train deep networks than shallow networks. However, as more layers are added, the training becomes more difficult for deep networks.
But another advantage of deep networks is that they generalize better on unseen data compared to shallow networks.
But we don’t really know why this is the case.