Principal Component Analysis ============================ The underlying assumption in principal component analysis (PCA) is that the data is 'concentrated' in an affine subspace of dimension $d$ of feature space (of dimension $n$). We model our data with the random vector $X$ with .. math:: \v\mu = E(\v X), \quad \Sigma = \Cov(\v X) To make it into a real subspace we subtract the mean $\v \mu$ from $\v X$: .. math:: \v X_z = \v X - \v \mu Maximizing Directional Variance ------------------------------- We have allready seen that the directional variance in direction $\v r$ is given by .. math:: \Var(\v r\T \v X_z) = \v r\T \Cov(\v X_z) \v r = \v r\T \Sigma \v r The direction of maximal variance (let's call it direction $\v u_1$ then is .. math:: \v u_1 = \arg\max_{\v r: \|\v r\|=1} \v r\T \Sigma \v r Conditional optimization can be done with the Lagrange multipliers method to incorporate the constraint in the function to be optimized: .. math:: \v u_1, \lambda_1 = \arg\max_{\v r, \lambda_1} \v r\T \Sigma \v r - \lambda_1 (\|\v r\|^2-1) The conditions for a maximum are then given by .. math:: \Sigma \v u_1 - \lambda_1 \v u_1 &= 0\\ \v u_1\T\v u_1 - 1 &= 0 The second equation is of course the condition we started with. The first equation is the important one: .. math:: \Sigma \v u_1 = \lambda \v u_1 showing that $\v u_1$ is an eigenvector of $\Sigma$. The directional variance then is $\v u_1\T \Sigma \v u_1 = \lambda$ (remember eigenvectors of a real symmetric matrix have unit length). As we are looking for the maximal value $\v u_1$ should be the eigenvector with maximal eigenvalue $\lambda_1$. Next we could look for another direction, say $\v u_2$ that is perpendicular to $\v u_1$ and. under that extra constraint, maximizes the variance. You correctly guessed, $\v u_2$ is the eigenvector with the second largest eigenvalue. The same reasoning applies for all the eigenvectors. .. Next we need to find the direction, *perpendicular to $\v u_1$*, that has maximum variance: .. math:: \v u_2, \lambda_2, \mu_2 = \arg\max_{r,\lambda,\mu} A Linear Algebra View --------------------- Principal component analysis transforms the zero mean data in such a way that $\v Y = A \v X_z$ has a diagonal covariance matrix. In that case the elements of $\v Y$ are uncorrelated. It is not to difficult to prove that $A=U\T$ where $U$ is the matrix of eigenvectors of $\Sigma$ such that $U\T\v X_z$ has a covariance matrix $D$. Proof: The eigen decomposition of $\Sigma$ is .. math:: \Sigma = U D U\T where the columns of $U$ are the eigenvectors and the diagonal elements of the diagonal matrix $D$ are the eigenvectors of $\Sigma$. Because for a real symmetrix matrix the eigenvectors form an orthogonal basis we can write: .. math:: D = U\T \Sigma U Remember that .. math:: \Cov(A\v X) = A\Cov(\v X)A\T = A\Sigma A\T Thus we see that $\v Y=U\T \v X_z)$ leads to $\Cov(\v Y) = D$. $\blacksquare$ Thusfar we have done any dimensionality reduction whatsoever. The random vector $\v Y$ is of the same dimension as the original vector $\v X_z$. But in case we assume that the eigen values in $D$ are in decreasing order, the last elements of $\v Y$ are in the directions of lowest variance. It is in the eigen vector basis that we may decide to leave out the basis vectors (eigen vectors) with low variance. We use only the first $k$ eigenvectors: .. math:: \v Y_k = \begin{pmatrix}Y_1\\Y_2\\\vdots\\Y_k\end{pmatrix} We can calculate $\v Y_k$ as .. math:: \v Y_k = U_k\T \v X_z where $U_k$ is the matrix with only the first $k$ columns (eigenvectors) of $U$. The elements of the vector $\v Y_k$ are the coordinates of the projected vector $X_\text{zm}$ with respect to the (reduced) eigenvector basis. Note that the number of elements in $\v Y_k$ is less than the number of elements in $\v X$ (i.e. $k