Clustering tasks

Clustering tasks are of great importance as they help to identify natural groupings in the data, based on similarities found in the selected input data. These groups are usually referred to as Clusters.

Rulex Factory provides users with Clustering tasks and integrates them with clustering algorithms. These algorithms are essential as the software applies them to solve unsupervised learning problems. Clustering algorithms are powerful tools for grouping similar data into different clusters based on their previously identified similarities. The clustering algorithms are all k-means, where k stands for the number of the cluster groups users want to create. Users have to specify which attributes to use for evaluating similarities. Each cluster will contain a central data point, known as the centroid, which represents the average of all the data points within its group.

Note

An algorithm is a precise set of rules, typically used to solve a class of specific problems, that produces an output or a decision.

Rulex Factory clusters data using k-means algorithms. This method splits the dataset into k clusters, where k is the number of clusters users desire to define. Each cluster is given a centroid, which is simply a data point that forms the epicenter of a single cluster. One of the most important things is the non-overlapping group data points.

It is recommended to split the data into two parts before using any Clustering task:

  • the training set, used to identify patterns in the data and build the model. It is usually made of the 70-80% of the data available.

  • the test set, used to assess the accuracy of the model. It is usually made of the 20-30% of the data available.

  • the optional validation set, which can be used for tuning the model parameters.

In Rulex, users can divide their datasets by using two different tasks:

If you want to apply the model to the data, remember to add a Apply Model task after the chosen clustering task.


Unsupervised learning

Cluster analysis is an unsupervised learning approach as we do not provide examples of a pre-defined output and the algorithm works exclusively on the input data.


Clustering tasks layout

Rulex Factory’s clustering tasks present common features and a common layout:

  • The Options tab, where users can choose the attribute they will work on and with.

    It is divided into two tabs:

Basic tab

In the Basic tab, according to the chosen task, users will find the following panes with their own specific options and features:

  • Attributes to consider for clustering

  • Clustering options

  • Labels attributes

  • Clustering attributes options

Advanced tab

In the Advanced tab, according to the chosen task, users will find the following options:

  • Number of executions. It is the number of subsequent executions of the clustering process (to be used in conjunction with Random as “Assigntype” option). Among the executions, the one with the best result is kept.

  • Initialize random generator with seed. If selected, a seed that sets the starting point in the sequence will be used during the randomization process.

  • Maximum number of iterations. It indicated the maximum number of iterations of the k-means inside each execution of the clustering process.

  • Keep attribute roles after task execution. If selected, input and output roles can be defined, overwriting the roles previously defined in the Data Manager.

  • Minimum decrease in error value. The error value corresponds to the average distance of each point from its corresponding centroid.

  • Aggregate data before processing. If selected, identical patterns will be aggregated and considered as a single pattern during the evaluation process.

  • Minimum number of occurrences. It represents the minimum number of examples within the training set that must be characterized by a given tag in order to successfully pass the filtering phase.

  • Filter patterns before clustering. If selected, data will be filtered. If not, all representative records will be included in the clustering process.

  • Minimum dispersion coefficient. If the profile attribute values show a dispersion coefficient, (computed in relation to the desired central value) greater than the value entered here, the record will show irregular behavior that may compromise the results of the clustering procedure and will therefore be discarded.

  • Append results. If selected, the results of this computation will be appended to the dataset. Otherwise, they will replace the results of previous computations.

  • The Monitor Tab, where users can view the properties of the new clusters.

  • The Clusters Tab, where users can visualize the group created.

  • The Results Tab, where users can visualize the results.