Label Clustering (k-means)

After having aggregated and filtered data, Rulex Factory allows users to perform a k-means clustering process.

The Label task is divided into four tabs:

  • The Options Tab, where users can choose the attribute they will work on and with.

  • The Monitor Tab, where users can view the properties of the new clusters.

  • The Clusters Tab, where users can visualize the group created.

  • The Results Tab, where users can visualize the results.


The Options tab

The Options tab is divided into two tabs: the Basic tab and the Advanced tab.

Available Attributes

Within this section, users will find a list of all the dataset’s attributes. To search for a specific attribute, use the lens icon at the top right of the panel. Users also have the option of sorting the attributes according to their preferences. They can choose from a drop-down list of the Order by option:

  • Attribute

  • Name

  • Type

  • Ignored

  • Role

Attributes Drop Area

The Attribute drop area contains the following options:

  • The Attributes to consider for clustering pane, where need to insert the attributes they want to use as input for creating their cluster computation. This operation can be done via a Manual List (users need to manually Drag & Drop the selected attributes onto the pane) or via a filtered list.

  • The Label attributes, where users can drag and drop the attributes that that will be considered as labels in the clustering computation.

Customization Panes

  • Clustering type. Through a drop-down list, users can choose between three different approaches for computing cluster centroids:

    • k-means, where the mean is used to compute the cluster centroid.

    • k-medians, where the median is used to compute the cluster centroid.

    • k-medoids, where the point of the dataset closest to the mean is used as the cluster centroid.

  • Clustering algorithm. Through a drop-down list, users can choose between three different clustering algorithms:

    • Standard, where cluster centroids are recomputed only after all the points have been reassigned;

    • Incremental, where cluster centroids are recomputed after each point moving;

    • Error-based, where point moving is decided by minimizing the error, instead of the distance from cluster centroid.

  • Distance method for clustering. Users can choose the method employed for computing distances between examples. Available methods are:

    • Euclidean

    • Euclidean (normalized)

    • Manhattan

    • Manhattan (normalized)

    • Pearson

  • Distance method for evaluation. Users can choose the method required for distance. Available methods are:

    • Euclidean

    • Euclidean (normalized)

    • Manhattan

    • Manhattan (normalized)

    • Pearson

  • Normalization for ordered variables. The type of normalization to use when treating ordered variables. Available options are:

    • None

    • Attribute

    • Normal

    • Minmax [0.1]

    • Minimax [-1,1]

  • Initial assignment for clusters. Procedure adopted for the initial assignment of points to clusters. Available options are:

    • Random

    • Smart

    • Weight-based

  • Number of clusters to be generated. The required number of clusters. The number of clusters cannot exceed the number of different examples in the training set.

  • Attribute for initial cluster assignment. Users can optionally select a specific attribute from the drop-down list, which will be used as an initial cluster assignment.

  • Attribute for weights. Users can optionally select an attribute from the drop-down list, which will be used as a weight in the clustering process.

  • In the Advanced tab, users will find the several advanced customization options. For more information, refer to the introductory Clustering tasks page.


The Monitor tab

Within this tab, users can view the properties of the new clusters.

The Monitor tab is divided itself into two other tabs:

Elements

In the Elements tab, users can view a histogram displaying the number of elements.

  • On the X axis, users can visualize the Range.

  • On the Y axis, users will visualize the counts.

Information such as Count, Range, and Percentage on tot will be displayed by hovering on one bar displayed in the tab.

Dispersion

Within the Dispersion tab, the dispersion coefficient can be visualized.

  • On the X axis, users can visualize the Range.

  • On the Y axis, users will visualize the counts.

Information such as Count, Range, and Percentage on tot will be displayed by hovering on one bar displayed in the tab.


The Clusters tab

In the Clusters tab, users can visualize the groups created before.

This tab contains a spreadsheet showing:

  • the values of the profile attributes for the centroids of created clusters

  • the number of elements

  • the dispersion coefficient

Note

  • The column cluster: contain the index of the cluster (centroids)

  • The column nelem: contains the number of elements within a cluster

  • The column disp: contains the dispersion coefficient


The Results tab

In the Results tab, users can visualize a summary of the results.

This tab is divided into two panes:

General Info

Within this pane, users can find the following information:

  • Task label

  • Elapsed time (sec)

  • Number of single samples

  • Average weigh of single samples

  • Number of clusters

  • Average dispersion of clusters

  • Dispersion of default cluster

  • Minimum number of points in a cluster

  • Maximum number of points in a cluster

  • Number of singleton clusters

  • Davies-Bouldin index

  • Inter-cluster distance variance

  • Intra-cluster distance variance

Result Quantities

Within this pane, users can set and configure the following options:

  • Average dispersion

  • Average weight

  • Maximum dispersion

  • Minimum dispersion

  • Number of distinct samples

  • Number of samples

All these checkboxes are checked by default.

On the right of the above-mentioned checkboxes, users, through a drop-down list, will be able to choose between the following options:

  • Train

  • Test

  • Valid

  • Whole


Example

  • After having imported the selected dataset through an Import from Text file task, drag a Data Manager onto the stage and connect it to the task. Configure the task as explained above, then save and compute the task.

https://cdn.rulex.ai/docs/Factory/clustering_label_ex1.webp
  • Drag a Split Data task onto the stage to randomly split the dataset into two subsets (test and training set; 30% test and 70% training) and connect it to the Data Manager. Save and compute the task. Then, drag a Label Clustering (K-means) task onto the stage and link it to the Split Data task and configure it as explained in the sections above.

https://cdn.rulex.ai/docs/Factory/clustering_label_ex2.webp
  • The properties and characteristics of the new generated clusters can be visualized in the Monitor tab and in the Clusters tab.

https://cdn.rulex.ai/docs/Factory/clustering_label_ex3.webp https://cdn.rulex.ai/docs/Factory/clustering_label_ex4.webp