Clustering tasks¶

In Rulex Factory clustering tasks use the k-means clustering algorithms, which divides samples in a dataset into a number k of groups (clusters), according to a given measure of similarity: two examples belonging to the same group must exhibit a higher value of similarity than two patterns associated with different clusters.

This type of problem is called an unsupervised learning problem, and its output is a collection of clusters characterized by an index, a central vector (centroid), and a dispersion value, measuring the normalized average distance of cluster members from the centroid.

Before using any Clustering task, it can be useful to split the data into the following subsets:

the training set, used to identify patterns in the data and build the model. It is usually made of the 70-80% of the data available.
the test set, used to assess the accuracy of the model. It is usually made of the 20-30% of the data available.
the optional validation set, which can be used for tuning the model parameters.

In Rulex, users can divide their datasets by using two different tasks:

the Data Manager
the Split Data

If the model must be applied to data, remember to add an Apply Model task after the clustering task.

Clustering tasks layout¶

Rulex Factory’s clustering tasks share some features and design. They are divided into four tabs:

The Options tab,
the Monitor tab,
the Clusters tab,
the Results tab

The Options tab

In the Options tab users can configure the analysis’ features. It is divided into two tabs: the Basic tab and the Advanced tab.

In the Basic tab, users will find the following panes:
- The Available attributes list, where the dataset’s attributes are listed. It is always available and visible both in the Basic and in the Advanced tabs.
- The attributes drop area, where users can drag the chosen attributes on the following panes:
  - Attributes to consider for clustering: drag the list of the attributes that will be used as profile attributes in the clustering computation. This operation can also be done via a filtered list.
  - Label attributes: drag the attributes that will be considered as labels in the clustering computation.
- Two customization panes, located under the attributes drop area one next to the other, containing the following options:
  - Clustering type: select from the drop-down list the approach to compute cluster centroids. The following approaches are available: k-means, k-medians, k-medoids.
  - Clustering algorithm: select from the drop-down list the clustering algorithm. The following algorithms are available: Standard, Incremental, Error-based.
  - Distance method for clustering: select from the drop-down list the method employed for computing distances between examples. The possible methods are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized), Pearson.
  - Distance method for evaluation: select from the drop-down list the method required for distance. The available methods are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized), Pearson.
  - Normalization for ordered variables: select from the drop-down list the type of normalization to use when treating ordered variables. The possible values are: None, Attribute, Normal, Minmax [0.1], Minimax [-1,1].
  - Initial assignment for clusters: select from the drop-down list the procedure adopted for the initial assignment of points to clusters. Possible options are: Random, Smart, Weight based.
  - Number of clusters to be generated: the required number of clusters. The number of clusters cannot exceed the number of different examples in the training set.
  - Attribute for initial cluster assignment: optionally select a specific attribute from the list, which will be used as an initial cluster assignment.
  - Attribute for weights: optionally select an attribute from the list, which will be used as a weight in the clustering process.
In the Advanced tab, users will find advanced configuration options, which can be used to better customize the analysis. The following options are available:
- Number of executions: the number of subsequent executions of the clustering process (to be used in conjunction with Random as “Assigntype” option). Among the executions, the one with the best result is kept.
- Initialize random generator with seed: if selected, a seed that sets the starting point in the sequence will be used during the randomization process.
- Maximum number of iterations: the maximum number of iterations of the k-means inside each execution of the clustering process.
- Keep attribute roles after task execution: if selected, input and output roles can be defined, overwriting the roles previously defined in the Data Manager.
- Minimum decrease in error value: the error value which corresponds to the average distance of each point from its corresponding centroid.
- Aggregate data before processing: if selected, identical patterns will be aggregated and considered as a single pattern during the evaluation process.
- Minimum number of occurrences: the minimum number of examples within the training set that must be characterized by a given tag in order to successfully pass the filtering phase.
- Filter patterns before clustering: if selected, data will be filtered. If not, all representative records will be included in the clustering process.
- Minimum dispersion coefficient: if the profile attribute values show a dispersion coefficient, (computed in relation to the desired central value) greater than the value entered here, the record will show irregular behavior that may compromise the results of the clustering procedure and will therefore be discarded.
- Append results: if selected, the results of this computation will be appended to the dataset. Otherwise, they will replace the results of previous computations.

The Monitor tab

Within this tab, users can view the properties of the new clusters. The Monitor tab is divided itself into two other tabs: the Elements tab and the Dispersion tab.

In the Elements tab, users can view a histogram describing each generated cluster.

On the X axis, users can visualize the clusters’ Range.
On the Y axis, users will visualize the counts, which is the number of clusters contained in the corresponding range.

By hovering over each bar, information on the following elements can be visualized:

Count: the number of clusters contained in the bar,
Range: the clusters’ range,
Percentage on tot: the percentage the cluster represents over the total number of clusters,
Percentage on bar: the percentage the cluster represents over the clusters contained in the bar.

Within the Dispersion tab, the dispersion coefficient can be visualized.

On the X axis, users can visualize the dispersion’s Range.
On the Y axis, users will visualize the counts, which is the number of clusters which have the corresponding dispersion range.

By hovering over each bar, information on the following elements can be visualized:

Information such as Count, Range, and Percentage on tot will be displayed by hovering on one bar displayed in the tab.

The Clusters tab

The Clusters tab contains a spreadsheet displaying the values of the profile attributes for the centroids of the generated clusters, along with the number of elements and the dispersion coefficient for each of them.

The following additional columns are displayed:

the columns cluster(attribute_name) refer to a specific tag included in some patterns of the training set,
the column cluster contains the index of the cluster,
the column nelem contains the number of elements which make up the cluster,
the column disp contains the dispersion coefficient.

The Results tab

In the Results tab, users can visualize computation information and details. This tab is divided into two panes, the General Info and the Result Quantities panes.

The General Info pane contains the following details:

Task label
Elapsed time (sec)

The Result Quantities pane contains information on the results, for more details on the displayed quantities go to the corresponding task’s page.