Deep Neural Networks

Table of contents
  1. Network Composition
    1. Folding Analogy
  2. Deep Neural Network
    1. Notations
    2. Capacity of the Network
  3. Comparison to Shallow Networks
    1. Both Can Approximate Any Function
    2. Deep Networks Have More Representational Power
    3. Depth Efficiency
    4. Large, Structured Inputs
    5. Training and Generalization

Network Composition

Let’s take a look at a composition of two single-input, single-output shallow networks.

Deep Concat

Define the first network as $f_1$ and the second network as $f_2$.

Then we can say $y’ = f_2(f_1(x))$.

Folding Analogy

Deep Fold

Take Network 1 above as a piece of paper. If you were to fold the paper whenever there is a joint along the input axis, you would end up with three regions.

With the paper already folded, in each region, you are going to stretch, flip and scale Network 2 onto it to fit the size of the region.

Then whenever there is a joint on Network 2, you’re going to fold the paper again.

So you have three regions in Network 1, folding each everytime you see a new region in Network 2, will give you 9 regions in total.

Then for the final output, you unfold the paper and you have the final network.


Deep Neural Network

Instead of composition, we can have a single network with multiple layers and achieve the same result.

For example:

Deep Equiv

Notations

Deep K Layer

Organizing the notations from shallow neural networks:

  • $K$: Depth of layers (number of hidden layers)
  • $D_k$: Width of layer $k$ (number of hidden units in layer $k$)

    $K$ and $D_k$ are hyperparameters.

  • $\inputx$: Input vector dimension $D_i$
  • First and $k$-th hidden layer

    $$ \begin{align*} \hidden_1 &= \activate[\biasvec_0 + \weightmat_0 \inputx] \\[1em] \hidden_k &= \activate[\biasvec_{k-1} + \weightmat_{k-1} \hidden_{k-1}] \end{align*} $$

  • $\outputy$: Output vector dimension $D_o$

    $$ \outputy = \biasvec_K + \weightmat_K \hidden_K $$

Capacity of the Network

The capacity of the network is the total number of hidden units:

$$ \sum_{k=1}^{K} D_k $$


Comparison to Shallow Networks

Both Can Approximate Any Function

Shallow networks are mimiced by deep networks where all the other hidden layers except the first are just identity functions.

So both shallow and deep networks obey the universal approximation theorem.

Deep Networks Have More Representational Power

For the same number of hidden units, deep networks have more representational power, meaning they create more parameters and linear regions.

For a single input-output network, having 6 hidden units in a single hidden layer creates $12 + 7 = 19$ parameters.

Splitting the 6 hidden units into 2 hidden layers with 3 hidden units each creates $6 + 4 + 4 + 6 = 20$ parameters.

In terms of linear regions, refer back to the folding analogy above.

The number of regions gets multiplied each layer, while increase in regions in a shallow network is additive. Therefore, deep networks generally create more linear regions.

Dependencies and Symmetries Among Regions

Deep layers are sequentially dependent on each other, and we can imagine from the folding analogy that each regions contain some symmetries.

So just because deep networks have more linear regions, we cannot always say this is desirable.

Whether we want this effect depends on the nature of the problem.

Depth Efficiency

We said that both shallow and deep networks can approximate any function.

However, there are some cases (not all) that require exponentially more hidden units in a shallow network than in a deep network.

Some tasks can be done with a few hundred hidden units in multiple layers, but require millions of hidden units in a single layer.

Again, whether this effect is desirable depends on the nature of the problem.

Large, Structured Inputs

Images, when flattened into a single vector, can easily have millions of input dimensions.

Even with a single layer, a fully connected network creates a real computational burden.

Also, in the case of images, we don’t always want to treat each pixel as an independent feature. Sometimes we want to lump together pixels and process them in local groups.

Deep networks such as convolutional neural networks (CNNs) can handle this by making the inputs go thorough multiple pooling and convolutional layers.

It is not feasible to do this with a single layer.

Training and Generalization

Upto a certain number of hidden layers, it is easier to train deep networks than shallow networks. However, as more layers are added, the training becomes more difficult for deep networks.

But another advantage of deep networks is that they generalize better on unseen data compared to shallow networks.

But we don’t really know why this is the case.