Principal component analysis (PCA)

The principal component analysis calculates the Eigen vectors of the covariance matrix and orders them according to their Eigen values. The Eigen vector with the largest Eigen value represents the first principal component, on which the highest variance in the data is mapped. Consequently the PCA can be used to reduce the dimensionality of the data by collecting the majority of the variance information in new artificial dimensions that are linear combinations of the original dimensions in the data. (Similar to the calculation of the robust distance also for the PCA robust estimates for the covariance matrix can be used.)

In the following example the four measurement dimensions of the Iris dataset are mapped to the first principal component. This new artificial dimension - calculated by a classic estimate of the covariance matrix - describes 92.5 percent of the variance in the data, what can also be interpreted as 92.5 percent of the information in the data.

In the following visualization the IDs of the data items, their projection on the first principal component and their species is shown. The data is coloured according to their species.

Illustration of the first principal component of the Iris dataset
Picture 8: Illustration of the first principal component of the Iris dataset

In this picture it can be seen that the four measurement dimensions clearly separate the blue species from the others. The two remaining species show overlapping values and an outlier of the red species is far within the value range of the green species.