Comparison between PCA and LDA

PCA and LDA are two popular dimensionality reduction methods commonly used on data with too many input features. In many ways the two algorithms are similar, but at the same time very dissimilar. This article highlights some of the similarities and dissimilarities between these two popular algorithms.

Let’s remind ourself how these two algorithms work.
Principal Component Analysis or PCA for short is a commonly used unsupervised linear transformation technique that reduces the number of dimensions by finding the maximum variance in high dimensional data.
Linear Discriminant Analysis or LDA for short is a supervised method that takes class labels into account when reducing the number of dimensions. The goal of LDA is to find a feature subspace that optimizes class separability.

Algorithm’s Learning mechanism

While both rely on decomposing matrices of eigenvalues and eigenvectors, the biggest difference between the two lays in the basic learning approach, where PCA is unsupervised, LDA is supervised.
PCA reduces dimensions by looking at the correlation between different features, by creating orthogonal axes - or principal components - with the direction of maximum variance as a new subspace.
Essentially PCA generates components based on the direction in the data where there is most variance - eg the data is most spread out. This component is referred to as both principals, and eigenvectors, and represent a subset of the data that contains most of the information - or variance - of our data. LDA on the other side does almost the same, but it contains a “pre-processing” step calculating mean vectors from the class labels before extracting the eigenvalues.

Step by step this will look like this

LDA
1 - For each class label: compute the d-dimensional mean vector
2 - Construct a scatter matrix within each class, and between each class.

This means that we first generate a mean vector for each label, so if we have three labels, we will generate three vectors.
Then with these three mean vectors, we generate a scatter matrix for each class, and then we sum up the three individual scatter matrices into one final matrix. We now have the within each class matrix.

To generate the between each class matrix we take the overall mean value from the original input dataset, and then subtract the overall mean with the mean of each mean vector as a dot product of itself. This is best explained with the equation given below, where m is the overall mean from the original input data.
$$
S_B = \sum\limits_{i=1}^{c} N_{i} (\pmb m_i - \pmb m) (\pmb m_i - \pmb m)^T
$$

PCA
1 - Construct the covariance matrix by taking the joint covariance - or correlation in some cases - between each pair in the given vector.

Then with the generated matrix we..

1 - Compute eigenvectors and eigenvalues of the matrix
2 - Sort the eigenvalues by decreasing order to rank the eigenvectors
3 - Get the k eigenvectors which corresponds to the k largest eigenvalues
4 - Construct a projection Matrix from the top k eigenvectors
5 - Transform the original input dataset with the newly created projection

This was a comparison between PCA and LDA, and as we can see they are very similar; both are linear transformation techniques that decompose matrices of eigenvalues and eigenvectors. The main difference is that LDA takes into class labels into account, where PCA is unsupervised and does not.

Dictionary:

  • Eigenvectors: Scaled version of the original vector
  • Eigenvalues: Scalar used to transform an Eigenvector
Share