Label Clustering (k-means)#

The Label Clustering task performs a clustering process, according to the k-means approach, after having aggregated and filtered data according to a subset of label variables.

Label clustering can also be employed for the solution of signal prediction problems, where the behavior of a mono-dimensional output signal (described by the profile variables) has to be predicted starting from a set of label attributes.

The output of the task is a collection of clusters characterized by:

  • a (positive integer) index,

  • a central vector (centroid), and

  • a dispersion value measuring the normalized average distance of cluster members from the centroid. Each cluster is associated with a combination of projected clusters, one for each label variable.

The Label task is divided into four tabs:


The Options tab#

The Options tab is divided into two tabs: the Basic tab and the Advanced tab.

More information on this tab and on its features are available in the clustering tasks homepage.


The Monitor tab#

Within this tab, users can view the properties of the new clusters.

The Monitor tab is divided into two tabs: Elements tab and the Dispersion tab.

More information on this tab and on its features are available in the clustering tasks homepage.


The Clusters tab#

The Clusters tab contains a spreadsheet displaying the values of the profile attributes for the centroids of the generated clusters, along with the number of elements and the dispersion coefficient for each of them.

More information on this tab and on its features are available in the clustering tasks homepage.


The Results tab#

In the Results tab, users can visualize a summary of the results.

This tab is divided into two panes:

  • General info, whose features can be found in the clustering tasks homepage.

  • Result quantities, where users can visualize the following results quantities:

    • Average dispersion

    • Average dispersion of clusters

    • Average weight

    • Average weight of single samples

    • Davies-Bouldin index

    • Dispersion of default cluster

    • Inter-cluster distance variance

    • Intra-cluster distance variance

    • Maximum dispersion

    • Maximum number of points in a cluster

    • Minimum dispersion

    • Minimum number of points in a cluster

    • Number of clusters

    • Number of distinct samples

    • Number of samples

    • Number of single samples

    • Number of singleton clusters

    • PseudoF index (Calinski-Harabasz criterion)

    • Relative PseudoF index


Example#

The following example uses the Adult dataset.

  • After having imported the selected dataset through an Import from Text file task, drag a Data Manager onto the stage and connect it to the import task.

  • In the Data Manager, right-click on the Income attribute, then select Ignored> Set.

  • Save and compute the task.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-1.webp
  • Drag a Split Data task onto the stage to randomly split the dataset into two subsets (test and training set; 30% test and 70% training) and connect it to the Data Manager.

  • Save and compute the task.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-2.webp
  • Then, drag a Label Clustering (K-means) task onto the stage and link it to the Split Data task and configure it as follows:

    • Attributes to consider for clustering: age, education-num, capital-gain, hours-per-week.

    • Label attributes: workclass, occupation.

    • Normalization for ordered variables: Normal.

    • Number of clusters to be generated: 2.

    • Uncheck the Filter patterns before clustering option in the Advanced tab.

  • Save and compute the task.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-3.webp https://cdn.rulex.ai/docs/Factory/label-clustering-example-4.webp
  • The properties and characteristics of the new generated clusters can be visualized in the Monitor tab and in the Clusters tab.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-5.webp
  • To view information concerning the representative records created by the Label Clustering (K-means) task, right-click on it, then select Take a look.

  • For each sample (row) of the training set we have the following result variables (columns)

    • assign: the initial assignment required by the user for the corresponding representative record; if no variables with role cluster id is included in the original dataset or its assignment is ambiguous (different cluster number for patterns with the same tag), the value in this column is set to missing.

    • weight: the weight of the associated representative record, computed as the sum of the weights associated with the patterns of the training set which have that tag.

    • nocc: the number of patterns in the training set which have the same tag as the considered sample (row).

    • disp: the dispersion coefficient obtained when computing the values of the profile attributes for the representative record.

    • sample: since several patterns may exist, which have the same tag as the current sample, the first occurrence of that tag in the training set is marked by setting the value 1 in the column; all the other occurrences assume the value 0 for this attribute.

    • selected: samples passing the data filtering phase assume the value 1 in this column; if no filtering is performed (by clearing the corresponding option in the Label Clustering (K-means) task) this variable assumes the value 1 in all patterns.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-6.webp
  • Executing the Apply Model task creates the index of the cluster to which each pattern in the training and in the test set belongs. This is obtained by finding the cluster that includes the same tag as the considered sample; the default cluster is selected if that tag was not considered in the clustering process (i.e. it was not present in the training set).

https://cdn.rulex.ai/docs/Factory/label-clustering-example-7.webp
  • 14 additional result attributes have been added to the dataset as can be seen by right-clicking the computed task and selecting Take a look. The first three result attributes concern the cluster associated with the current pattern:

    • pred(Output): the index of the cluster.

    • conf(Output): the confidence of the association between cluster and pattern, given by 1-0.5*d1/d2, where d1 and d2 are the distances from the nearest and the second-nearest centroid, respectively. Since d1<d2 the confidence always lies in the interval [0.5,1].

    • clust(Output): the row of the Clusters tab in the Label Clustering task containing the associated cluster.

The subsequent five result variables report the values of the profile attributes for the centroid of the associated cluster: pred(age), pred(education-num), etc.

The remaining six result variables concern the error generated when these values are employed as a forecast for the actual profile attributes of the pattern. In particular, the first of these result variables (error) provides the total error, whereas the others (err(age), err(education-name) etc.) provide the error for each attribute.

https://cdn.rulex.ai/docs/Factory/label-clustering-example-8.webp