K-MEANS CLUSTERING

During my time working through a computer vision related course, we were developing an application capable of reconstructing three dimensional voxel clouds of people through analysis of video recordings. In addition to reconstruction of the subjects' body shape, we were also interested in deriving to which person individual voxels belong.

Assuming that each subject is wearing clothes of roughly uniform color, the idea was to perform clustering over voxel colors and in that way achieve a reasonable classification of voxels onto persons. This project is about one such clustering technique: the \(k\)-means algorithm.

Table of contents

About clustering

The objective of clustering methods is to divide a set of observations \(X\) among a set of categories \(C\). More formally, an observation \(\textbf{v} \in X\) is a \(d\)-dimensional feature vector, and we are looking to derive a mapping \(X \rightarrow C\).

\(k\)-Means

The \(k\)-means algorithm is one of if not the most popular method of achieving such a mapping. It consists of four steps, summarized below:

  1. Initialize with a guess of a mean feature vector representing each cluster (category). This is our best guess as to what a typical observation of that category looks like.
  2. Assign each observation to the cluster whose mean it is closest to.
  3. Update the means by taking all observations assigned to each cluster and computing the new mean value for each.
  4. Repeat steps 2 and 3 until there is no more change in the assignment.

Figure 1 depicts a progression of \(k\)-means on sample data:

Figure 1: Progression of \(k\)-means on some randomly generated data. The larger discs represent cluster means.

Implementation

The algorithm was implemented as a C++ header-only library. The library makes extensive use of templates in order to achieve better performance for large datasets.

We also include several methods for guessing the initial means, as well as alternative distance-measures (although the Euclidean distance is the most commonly used, sometimes others are preferred depending on the application).

Source code

You can view the source code and documentation on GitHub. The python scripts used to generate the plots on this page are under the demo directory of the repository.

See also