Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality reduction technique, that finds a few principal components that best summarize the information in a dataset.

We project $p$-dimensional feature space to a $q$-dimensional subspace.

By “best summarize”, we mean that the principal components capture the maximum variability in the data.

It is an unsupervised learning method.

Table of contents

Benefits of PCA
Euclidean Projection
First Principal Component
Other Principal Components
Proportion of Variance Explained
Singular Value Decomposition

Benefits of PCA

Visualization of high-dimensional data
Data compression (efficient representation)
Improved generalization
Noise removal

Euclidean Projection

Introducing PCA in terms of projections.

See Orthogonal Projection

We have a Euclidean vector space,

\[(\mathbb{R}^p, \langle \cdot, \cdot \rangle)\]

where the inner product is just the dot product.

We want to project a vector $\boldsymbol{x} \in \mathbb{R}^p$ onto some $k$-dimensional subspace $U \subseteq \mathbb{R}^p$, spanned by orthonormal basis $\{ \mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_k \}$.

The orthonormal basis vectors are the principal components.

The way we select the basis vectors is important, because we want to retain as much information as possible.

The projection of $\boldsymbol{x}$ onto a single unit vector $\mathbf{u}$ is:

\[\pi_{\mathbf{u}}(\boldsymbol{x}) = (\boldsymbol{x}^\top \mathbf{u}) \mathbf{u}\]

The length of the projection $\boldsymbol{x}^\top \mathbf{u}$ is a measure of where you stand in the direction of $\mathbf{u}$. In PCA, we call this the score of the principal component.

The projection of $\boldsymbol{x}$ onto $U$ is:

\[\pi_U(\boldsymbol{x}) = \sum_{j=1}^k (\boldsymbol{x}^\top \mathbf{u}_j) \mathbf{u}_j\]

Instead of a single vector $\boldsymbol{x}$, we have a dataset $\mathbf{X} \in \mathbb{R}^{n \times p}$.

Let $\mathbf{U} \in \mathbb{R}^{p \times k}$ be the matrix of basis vectors:

\[\mathbf{U} = \begin{bmatrix} \mathbf{u}_1 & \mathbf{u}_2 & \ldots & \mathbf{u}_k \end{bmatrix}\]

The projection of $\mathbf{X}$ onto $U$ is:

\[\mathbf{X} \mathbf{U} \mathbf{U}^\top\]

First Principal Component

The idea is pretty much the same for other principal components, but we will start with the first principal component.

Let $X_1, X_2, \ldots, X_p$ be the set of features.

Features Centered (and Scaled)

We assume each feature $X_j$ is centered to have sample mean zero:

$$ \overline{X}_j = \frac{1}{n} \sum_{i=1}^n x_{ij} = 0 $$

Score Vector

The first principal component score of the feature set is the normalized linear combination of the features:

$$ Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \ldots + \phi_{p1}X_p = X \phi_1 $$

This is the coordinate of the data in the direction of the first principal component.

Loading Vector

The loading vector of the principal component is the vector of coefficients $\phi_{j1}$:

\[\phi_1 = \begin{bmatrix} \phi_{11} \\ \phi_{21} \\ \vdots \\ \phi_{p1} \end{bmatrix}\]

This is a measure of how much each feature contributes to the principal component.

Constraint on Loading Vector

Principal components are normalized linear combinations of the original features, i.e. the loading vector $\phi_1$ satisfies the constraint:

$$ \lVert \phi_1 \rVert_2 = 1 $$

The loading vector, centered at the origin, is bound to lie on the unit sphere.

Without this constraint, the principal component could have arbitrarily large variance.

Maximizing Sample Variance

Keep in mind that our features are centered, and so are the scores of the principal components.

As explained in the beginning, principal component tries to maximize the variance:

\[\Var(Z_1) = \frac{1}{n} \sum_{i=1}^n z_{i1}^2 = \frac{1}{n} \lVert X \phi_1 \rVert_2^2 = \frac{1}{n} \phi_1^\top X^\top X \phi_1\]

Together with the constraint on the loading vector,

$$ \phi_1 = \arg \max_{\phi} \phi^\top X^\top X \phi \quad s.t. \quad \lVert \phi \rVert_2 = 1 $$

Other Principal Components

Following principal component must be orthogonal to all the previous ones.

So we add additional constraints to the optimization problem:

$$ \phi_k = \arg \max_{\phi} \phi^\top X^\top X \phi \quad s.t. \quad \lVert \phi \rVert_2 = 1 \quad \land \quad \phi^\top \phi_j = 0 \quad \text{for } j < k $$

Assuming $n > p$, there are at most $p$ principal components.

Proportion of Variance Explained

So we can have at most $p$ principal components, but most of the time, we try to reduce the dimensionality of the data.

But how do we know how many principal components to have? We can’t rely on things like cross-validation because PCA is unsupervised.

This is where the proportion of variance explained (PVE) comes in.

Again, remember that our features are centered, so our variance is the sum of squares divided by $n$.

For each principal component $k$, we can calculate the proportion of variance explained, the amount of variance explained by the $k$-th principal component divided by the total variance:

$$ \text{PVE}_k = \frac{\phi_k^\top X^\top X \phi_k}{\sum_{j=1}^p \phi_j^\top X^\top X \phi_j} $$

We can either plot the PVE or the cumulative PVE to decide how many principal components to keep.

Singular Value Decomposition

In practice, we use the singular value decomposition (SVD) to compute the normalized principal component scores, loading vectors, and the amount of variance explained.

Given the data matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$,

\[X = UDV^\top\]

Left singular vectors $U \in \mathbb{R}^{n \times p}$ are the normalized principal component scores.
By normalized, we mean that each score was divided by its standard deviation.
Singular value of the diagonal elements $\sigma_j$ in $D \in \mathbb{R}^{p \times p}$ squared and divided by $n$ ($\sigma_j^2 / n$) is the amount of variance explained by the $j$-th principal component.
Right singular vectors $V \in \mathbb{R}^{p \times p}$ are the loading vectors.