Saturday, November 29, 2014

Dimension Reduction using spectral decomposition (0) - PCA


Introduction

I survey most of dimensional reduction methods by using spectral decomposition such as PCA(Principal Component Analysis), LDA(Linear Discriminant Analysis), FLD(Fischer Linear Discriminant), MDA(Multivariate Discriminant Analysis), CCA(Cannonical Correlation Analysis). The following description is in chronological order. Whenever we move to the next methodology, the conditions of problems are more complicated. There are so many articles about these methodology. But I could not found any report that focus on the relationship between methods. I need to know why our predecessors developed these:  from PCA to CCA or other more complicated method. My final goal is setup a unified  table of all methods using same mathematical notation for easier understanding.



Line Fitting on Sample Points

Before we start to discuss PCA, let's review typical line fitting problem ussing sum of squared errors.
All of basic machine learning methods are based on minimzing sum of squared errors.



Let's see a regression line $\mathbf{\mathbf{x}_{0}+}a\mathbf{e}$ (where $\left\Vert \mathbf{e}\right\Vert =1$ ) on the sample $\mathbf{x}_{n}$.

Objective function $f$ to minimize is : \begin{equation} f(\mathbf{x}_{0},\mathbf{e},a_{0},...a_{N})=\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-(\mathbf{x}_{0}+a_{n}\mathbf{e})\right\Vert ^{2} \end{equation} where $\mathbf{x}_{0}+a_{n}\mathbf{e}$ is the closed point to $\mathbf{x}_{n}$ on the regression line.

This sum of squared errors (SSE) is just one among several candidates of cost functions. L1 norm or other norms can be the cost function. If a specific cost function of error is given, we must use it instead of SSE. The reason why most of standard methods use it is easy to handle the formula. SSE is differentiable, so we can analyze its parameters in closed form expressions. (c.f. In logistics regression we should do numerical analysis.)

I will check the relationship between the regression line and $\mu$, mean of $\mathbf{x}_{n}$.

At first, calculate the derivative of $a_{n}$in terms of $\mathbf{x}_{0},\mathbf{x}_{n},\mathbf{e}$. \begin{equation} f(\mathbf{x}_{0},\mathbf{e},a_{0},...a_{N})=\sum_{n}^{N}\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2} \end{equation} \begin{eqnarray*} \frac{\partial f}{\partial a_{n}} & = & \frac{\partial\left(\sum_{n}^{N}\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}=\frac{\partial\left(\left\Vert a_{n}\mathbf{e}+\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}\\ & = & \frac{\partial\left(a_{n}^{2}\left\Vert \mathbf{e}\right\Vert ^{2}+2a_{n}\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n})+\left\Vert \mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}\right\Vert ^{2}\right)}{\partial a_{n}}\\ & = & 2a_{n}+2\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{0}-\mathbf{x}_{n}) \end{eqnarray*}

We can get $\underset{a_{n}}{\arg\min}(f)$ by ${\displaystyle \frac{\partial f}{\partial a_{n}}=0}$. \begin{equation} a_{n}=\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0}) \end{equation}

This means that projection of $(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0})$ on $\mathbf{e}$. Actually most of books skip this proof, since it is a basic linear algebra.

Now analyze $f$ with respect to $\mathbf{x}_{0}$.

In expanding $f$ I use $a_{n}$ instead of $\mathbf{e}^{T}(\mathbf{\mathbf{x}}_{n}-\mathbf{x}_{0})$ for a moment. It is just for easier calculation. \begin{eqnarray*} f(\mathbf{x}_{0},\mathbf{e}) & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}+a_{n}\mathbf{e}\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}\right\Vert ^{2}+2(a_{n}\mathbf{e})^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})+\left\Vert a_{n}\mathbf{e}\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}-\mathbf{x}_{n}\right\Vert ^{2}+2\sum_{n}^{N}a_{n}\overbrace{\left(\mathbf{e}{}^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})\right)}^{=-a_{n}}+\sum_{n}^{N}a_{n}^{2}\overbrace{\left\Vert \mathbf{e}\right\Vert ^{2}}^{=1}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-2\sum_{n}^{N}a_{n}(a_{n})+\sum_{n}^{N}a_{n}^{2}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}a_{n}^{2}\\ & = & \sum_{n}^{N}\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}(\mathbf{x}_{0}-\mathbf{x}_{n})^{T}\mathbf{e}\mathbf{e}^{T}(\mathbf{x}_{0}-\mathbf{x}_{n})\\ & = & N\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}\left(\mathbf{x}_{0}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}+\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{n}\right)\\ & = & N\left\Vert \mathbf{x}_{0}\right\Vert ^{2}-\sum_{n}^{N}\mathbf{x}_{0}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{x}_{0}+2\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}\right\Vert ^{2}-\sum_{n}^{N}\mathbf{x}_{n}^{T}\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{n} \end{eqnarray*} Similarly to the case of $a_{n}$, set ${\displaystyle \frac{\partial f}{\partial\mathbf{x}_{0}}=0}$ \begin{equation} \frac{\partial f(\mathbf{x}_{0},\mathbf{e})}{\partial\mathbf{x}_{0}}=2N\mathbf{\mathbf{x}_{0}}-2N\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}\mathbf{x}_{n} \end{equation} We don't need to build a closed form of $\mathbf{x}_{0}$. Just consider it in this implicit form. \begin{eqnarray*} \frac{\partial f(\mathbf{x}_{0},\mathbf{e})}{\partial\mathbf{x}_{0}} & = & 0\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2N\mathbf{e}\mathbf{e}^{T}\mathbf{x}_{0}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}\mathbf{x}_{n}\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\mathbf{e}^{T}\sum_{n}^{N}(\mathbf{x}_{n}-\mathbf{x}_{0})\\ & = & 2N\mathbf{\mathbf{x}_{0}}-2\sum_{n}^{N}\mathbf{x}_{n}+2\mathbf{e}\sum_{n}^{N}a_{n} \end{eqnarray*} \begin{equation} \frac{\partial f}{\partial\mathbf{x}_{0}}=0\Longleftrightarrow\frac{1}{N}\sum_{n}^{N}\mathbf{x}_{n}=\mu=\mathbf{x}_{0}+\mathbf{e}\left(\frac{1}{N}\sum_{n}^{N}a_{n}\right) \end{equation} The meaning of Eq.(5) is that the mean should be on the regression line $\mathbf{x}_{0}+a\mathbf{e}$. It is reasonable intuitively.


This explanation for Line Fitting is not general. It is for easier descripition of the relationship between PCA and Regression.


PCA(Principal Component Analysis)

I will explain PCA as a expansion of Line Fitting. 
The strategy of PCA is :

Analyze covariances of sample vectors and extract a new set of bases that represents samples more "Discriminable". Finally set up smaller size of new basis set by ignoring less discriminative bases. Why these type of processes are called as "Dimension Reduction" is that the size of new bases is smaller than original dimensions.

And "Discriminability" means larger variances.
See the following figure. After getting new basis vectors $\mathbf{e}_1$, $\mathbf{e}_2$, just ignore $\mathbf{e}_2$. Now sample data can be represented values projected on $\mathbf{e}_1$. This process reduces 2D to 1D

It is reasonable that basis $\mathbf{e}_1$ is more discriminative than $\mathbf{e}_2$ since $\mathbf{e}_1$ maximizes new variance, the variance of projected values on it.
$\mathbf{e}_2$ is also the largest basis representing the subspace orthogonal to $\mathbf{e}_1$. In reduction problems from 3D to 2D, $\mathbf{e}_2$ can be a member of new bases.

These new elements of new vectors, $a$s after projection on new bases is called as "Principal Components".



Now I will explain fundamentals of PCA step by step.
The following mathematical discription is based on the famous text book, "Richard O. Duda et al., Pattern Classification 2nd Ed.".

Let's think about only one new basis vector for samples. It is exactly same as Line Fitting explained in the previous section.
Rearrange the main formula of the previous section in terms of $e$ by setting $\mathbf{x}_0=\mu$. I will explain why $\mathbf{x}_0=\mu$ is fisible by using the previous analysis. In this step just assume that $\mu$ is the best representative value of samples' distribution.

\begin{eqnarray*} f(\mathbf{e}) & = & \sum_{n}^{N}\left\Vert a_{n}\mathbf{e}-(\mathbf{x}_{n}-\mu)\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left\Vert a_{n}\mathbf{e}\right\Vert ^{2}-2a_{n}\mathbf{e}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(a_{n}^{2}-2a_{n}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(-a_{n}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & -\sum_{n}^{N}a_{n}^{2}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\\ & = & -\sum_{n}^{N}\mathbf{e}^{T}(\mathbf{x}_{n}-\mu)(\mathbf{x}_{n}-\mu)^{T}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\\ & = & -N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2} \end{eqnarray*}

where $\Sigma$ is covariance matrix.

To minimize $f$, constraint $\left\Vert \mathbf{e}\right\Vert =1$ should be considered. $f$ is transformed to Lagrangian formula $L$ with a Lagrangian multiplier$\lambda$.

\begin{equation} L(\mathbf{e},\lambda)=-N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}+\lambda(\mathbf{e}^{T}\mathbf{e}-1) \end{equation} \begin{equation} \frac{\partial L(\mathbf{e},\lambda)}{\partial\mathbf{e}}=-2N\mathbf{\Sigma}\mathbf{e}+2\lambda\mathbf{e}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}=\lambda\mathbf{e} \end{equation}

This is a typical eigen problem and $(\mathbf{e},\lambda)$ is (eigenvector, eigenvalue). 

We can get the following analysis :

1. $\mathbf{rank}$ of covariance matrix means numbers of possible eigenvectors.
2. Since $\sum_{n}^{N}a_{n}^{2}$ is N times of variance of $a_{n}$s and $\sum_{n}^{N}a_{n}^{2} = N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}= \mathbf{e}^{T}\lambda\mathbf{e}=\lambda$, a larger $\lambda$ means a larger variance.
3. Since $f = -N\mathbf{e}^{T}\mathbf{\Sigma}\mathbf{e}+ ...= -\mathbf{e}^{T}\lambda\mathbf{e}=\lambda$,  $\mathbf{e}$ with the largest eigenvalue means a input value of global miminum. Others are local maximums.

Let's expand these analyses to the case of multiple basis vectors,
\begin{eqnarray*} f(\mathbf{e}_{0,}\mathbf{e}_{1},...,\mathbf{e}_{D}) & = & \sum_{n}^{N}\left\Vert \sum_{d}^{D}a_{n,d}\mathbf{e}_{d}-(\mathbf{x}_{n}-\mu)\right\Vert ^{2}\\ & = & \sum_{n}^{N}\left(\left(\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}\right)^{2}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}a_{n,d}^{2}+2\sum_{d\neq d'}^{D}a_{n,d}a_{n,d'}\underbrace{\mathbf{e}_{d}\mathbf{e}_{d'}}_{=0}-2\sum_{d}^{D}a_{n,d}\mathbf{e}_{d}^{T}(\mathbf{x}_{n}-\mu)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(\sum_{d}^{D}\left(a_{n,d}^{2}-2a_{n,d}^{2}\right)+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & \sum_{n}^{N}\left(-\sum_{d}^{D}a_{n,d}^{2}+\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2}\right)\\ & = & -N\sum_{d}^{D}\mathbf{e}_{d}^{T}\mathbf{\Sigma}\mathbf{e}_{d}+\sum_{n}^{N}\left\Vert \mathbf{x}_{n}-\mu\right\Vert ^{2} \end{eqnarray*}
where $D\leqq\mathbf{dim}(\mathbf{x}_n)$.
The process getting individual basis vectors is same as the previous process.

\begin{equation} \frac{\partial L(\mathbf{e}_{d},\lambda)}{\partial\mathbf{e}_{d}}=-2N\mathbf{\Sigma}\mathbf{e}_{d}+2\lambda\mathbf{e}_{d}=0\Longleftrightarrow N\mathbf{\Sigma}\mathbf{e}_{d}=\lambda\mathbf{e}_{d} \end{equation}

Since eigenvectors are orthogonal the eigenvectors of $N\mathbf{\Sigma}$ means new basis vectors for PCA.

As mentioned before. Larger $\lambda$ makes $f$ smaller. So we choose $D$ eigenvectors in order of size of $\lambda$ from large.



Let's go back to why $\mathbf{x}_0$ should be $\mu$. As I explained in the previous section, $\mu$ is on the line $\mathbf{e}_0$ (the largest basis). All of $\mathbf{e}_d$ have same formula.

\begin{equation} \mu=\mathbf{x}_{0}+\frac{1}{N}\sum_{n}^{N}a_{n,d}\mathbf{e}_{d} \end{equation}
$\mu$ should be on all of lines of $\mathbf{e}_d$. If $D>1$, $\mathbf{x}_0$ should be only one, $\mu$.


No comments: