4.1. Principal Component Analysis

The underlying assumption in principal component analysis (PCA) is that the data is ‘concentrated’ in an affine subspace of dimension \(d\) of feature space (of dimension \(n\)). We model our data with the random vector \(X\) with

\[\v\mu = E(\v X), \quad \Sigma = \Cov(\v X)\]

To make it into a real subspace we subtract the mean \(\v \mu\) from \(\v X\):

\[\v X_z = \v X - \v \mu\]

4.1.1. Maximizing Directional Variance

We have allready seen that the directional variance in direction \(\v r\) is given by

\[\Var(\v r\T \v X_z) = \v r\T \Cov(\v X_z) \v r = \v r\T \Sigma \v r\]

The direction of maximal variance (let’s call it direction \(\v u_1\) then is

\[\v u_1 = \arg\max_{\v r: \|\v r\|=1} \v r\T \Sigma \v r\]

Conditional optimization can be done with the Lagrange multipliers method to incorporate the constraint in the function to be optimized:

\[\v u_1, \lambda_1 = \arg\max_{\v r, \lambda_1} \v r\T \Sigma \v r - \lambda_1 (\|\v r\|^2-1)\]

The conditions for a maximum are then given by

\[\begin{split}\Sigma \v u_1 - \lambda_1 \v u_1 &= 0\\ \v u_1\T\v u_1 - 1 &= 0\end{split}\]

The second equation is of course the condition we started with. The first equation is the important one:

\[\Sigma \v u_1 = \lambda \v u_1\]

showing that \(\v u_1\) is an eigenvector of \(\Sigma\). The directional variance then is \(\v u_1\T \Sigma \v u_1 = \lambda\) (remember eigenvectors of a real symmetric matrix have unit length). As we are looking for the maximal value \(\v u_1\) should be the eigenvector with maximal eigenvalue \(\lambda_1\).

Next we could look for another direction, say \(\v u_2\) that is perpendicular to \(\v u_1\) and. under that extra constraint, maximizes the variance. You correctly guessed, \(\v u_2\) is the eigenvector with the second largest eigenvalue. The same reasoning applies for all the eigenvectors.

4.1.2. A Linear Algebra View

Principal component analysis transforms the zero mean data in such a way that \(\v Y = A \v X_z\) has a diagonal covariance matrix. In that case the elements of \(\v Y\) are uncorrelated. It is not to difficult to prove that \(A=U\T\) where \(U\) is the matrix of eigenvectors of \(\Sigma\) such that \(U\T\v X_z\) has a covariance matrix \(D\).

Proof: The eigen decomposition of \(\Sigma\) is

\[\Sigma = U D U\T\]

where the columns of \(U\) are the eigenvectors and the diagonal elements of the diagonal matrix \(D\) are the eigenvectors of \(\Sigma\). Because for a real symmetrix matrix the eigenvectors form an orthogonal basis we can write:

\[D = U\T \Sigma U\]

Remember that

\[\Cov(A\v X) = A\Cov(\v X)A\T = A\Sigma A\T\]

Thus we see that \(\v Y=U\T \v X_z)\) leads to \(\Cov(\v Y) = D\). \(\blacksquare\)

Thusfar we have done any dimensionality reduction whatsoever. The random vector \(\v Y\) is of the same dimension as the original vector \(\v X_z\). But in case we assume that the eigen values in \(D\) are in decreasing order, the last elements of \(\v Y\) are in the directions of lowest variance. It is in the eigen vector basis that we may decide to leave out the basis vectors (eigen vectors) with low variance. We use only the first \(k\) eigenvectors:

\[\begin{split}\v Y_k = \begin{pmatrix}Y_1\\Y_2\\\vdots\\Y_k\end{pmatrix}\end{split}\]

We can calculate \(\v Y_k\) as

\[\v Y_k = U_k\T \v X_z\]

where \(U_k\) is the matrix with only the first \(k\) columns (eigenvectors) of \(U\).

The elements of the vector \(\v Y_k\) are the coordinates of the projected vector \(X_\text{zm}\) with respect to the (reduced) eigenvector basis. Note that the number of elements in \(\v Y_k\) is less than the number of elements in \(\v X\) (i.e. \(k<n\)) and we thus have reduced the number of real values needed to represent the vector \(\v X\). That representation is only approximatively though.

With respect to the original basis we have

\[\begin{split}\v X_\text{zm} & \approx U_k \v Y_k\\ & \approx U_k U_k\T \v X_\text{zm}\end{split}\]

Note that in case \(k=n\), i.e. \(U_k = U\) we have an equality in the above expression.

Summarizing:

\[\v Y_k = U_k\T (\v X - E(\v X))\]

is the coordinate vector of \(\v X - E(\v X)\) with respect to the eigenvector basis and

\[\begin{split}\v X & \approx U_k \v Y_k + E(\v X)\\ & \approx U_k U_k\T (\v X - E(\v X)) + E(\v X)\end{split}\]

4.1.3. PCA in Practice

In practice PCA is used very often for dimensionality reduction. In practice however neither the expectation nor the covariance matrix are known. All we have is a sample of \(\v X\): \(\v x\ls 1,\ldots,\v x\ls m\).

Given this sample we may estimate the mean and covariance matrix:

\[\widehat{E(\v X)} = \bar{\v x} = \frac{1}{m} \sum_{i=1}^{m} \v x\ls i\]

and:

\[\widehat{\Cov(\v X)} = S = \frac{1}{m} \sum_{i=1}^{m} (\v x\ls i - \bar{\v x})(\v x\ls i - \bar{\v x})\T\]

With these two the basic formulas for PCA are:

\[\begin{split}S &= U D U\T\\ \v y_k &= U_k\T (\v x - \bar{\v x})\\ \v x &\approx U_k \v y_k + \bar{\v x}\end{split}\]

Please note that in case PCA is used for dimensionality reduction as a means of compressing the data, you must take into accound that \(U_k\) and \(\bar{\v x}\) are needed as well to approximately reconstrict \(\v x\) from its reduced representation \(\v y_k\).