k means clustering

The k means clustering algorithm chooses k data items randomly as initial cluster centers. Then each data item is assigned to the cluster with the nearest cluster center. After this assignment for each cluster a new center is calculated by computing the mean of all data items of the cluster. The assignment and the cluster center calculation is repeated several times until the cluster centers do no longer significantly change or the maximum iteration number has been reached.

For the clustering the following parameters can be set:

As result the distances of the data items to their cluster centers and the cluster assignment can be stored. In the following example a clustering for the Iris dataset is performed. This dataset contains four different measurements for three kinds of Iris flowers. The k means clustering tries to separate the Iris flowers into three groups based on these four attributes.

Clustering result of the Iris dataset in parallel coordinates
Picture 3: Clustering result of the Iris dataset in parallel coordinates

In this picture the four measurement dimensions and as fifth dimension the species are visible. The data items are coloured according to their cluster assignments. It can be seen that the red cluster perfectly identifies one species. The blue and the green cluster share the data items of the two similar species.

By selecting data items with a high distance to their cluster centers, items can be identified that are at the border of the clusters and may be part of a different group in the data.

Selecting data according to the distances to cluster centers
Picture 4: Selecting data according to the distances to cluster centers

In picture 4 the four measurement dimensions are visualized again. As fifth dimension the distances of the items to their cluster centers can be seen. By the green selection the data items are fully selected (100 %). The blue selection introduces a degree of interest (DOI), that linearly attenuates the selection from 100 % to 0 %.