« Mahalanobis and robust distance | Introduction to the user interface » |
The principal component analysis calculates the Eigen vectors of the covariance matrix and orders them
according to their Eigen values. The Eigen vector with the largest Eigen value represents the first principal component,
on which the highest variance in the data is mapped. Consequently the PCA can be used to reduce the dimensionality of
the data by collecting the majority of the variance information in new artificial dimensions
that are linear combinations of the original dimensions in the data. (Similar to the calculation of
the robust distance also for the PCA robust estimates for the covariance matrix can be used.)
In the following example the four measurement dimensions of the Iris dataset are mapped to the first principal
component. This new artificial dimension - calculated by a classic estimate of the covariance matrix - describes
92.5 percent of the variance in the data, what can also be interpreted as 92.5 percent of the information
in the data.
In the following visualization the IDs of the data items, their projection on the
first principal component and their species is shown. The data is coloured according to their species.
Picture 8: Illustration of the first principal component of the Iris dataset
In this picture it can be seen that the four measurement dimensions clearly separate the blue species from the others. The two remaining species show overlapping values and an outlier of the red species is far within the value range of the green species.
« Mahalanobis and robust distance | Introduction to the user interface » |