Data Structures¶

In Rulex Factory, each task receives a list of data structures from its input tasks (if any), and then forwards a modified list of structures to its output tasks.

Wires do not take into consideration which structures are forwarded, but simply forwards all the data structures to any connected child task.

Danger

Any data structure can be present only once in any flow branch. If a task create a new data structure it will always overwrite the structure of the same type received by previous tasks.

In the next sections, the details of any data structure will be presented.

Dataset¶

The dataset is the most important data structure in Rulex Factory: it represents data-structured format in the whole Platform.

The dataset describes data organizing them in a set of columns. Information is all stored at column level. The column is called an attribute (or sometimes a feature): it owns all the characteristics about type, roles, style and so on.

Attributes generated by some modeling tasks and then added to the underlying dataset are often called Results. Their differences are not functional, they are visible only in Data Manager tasks, hence they are attributes for all the other operational aspects.

Moreover, operations are always performed on the entire column. This is the main difference between the Factory dataset format and Microsoft Excel Sheet format, where any operation or configuration is performed at cell level.

Most of the other data structures share the same format, even if the data they represent may have a different origin.

Moreover, users are able to convert any other data structure to dataset format by using some dedicated tasks or procedures.

Important dataset’s features are:

Number of attributes
Number of rows
Attribute definitions (type, role, values, style…)
Column Data

The Manager task dedicated to their visualization and editing is the Data Manager.

Rules¶

The rules or ruleset data structure is the data structure representing If-then rules in Rulex Factory.

It is built as a set of rules, each one of them made of a set of conditions. All the rules are connected using the logic OR operator while all the conditions in each rule are connected using the logic AND operator.

A rule is a combination of conditions describing one of the possible output situation for the test variable.

A condition is a comparison between an attribute and some threshold values. Operators used in rules’ conditions are:

in, not in for nominal or binary attributes.
<= and > for ordered attributes.

Note

It can be proved that this format can represent any possible combination of AND/OR combination of single conditions, through straightforward modifications. It can also be proved that any comparison operator can be reduced to one of previous mentioned operations by suitable modifications of the threshold values.

The rule structure can be converted into a dataset by using the dedicated task.

In the rules structure not only the representation of the various rules is present, but also some side quantities useful to fully describe these quantities. For example, the rules structure contains the rules relevances. These relevances communicate the importance of any rule on the underlying data through two values, the Covering and the Error.

In some tasks, like the Export tasks, relevances can be exported or converted separately from the Rules they belong to.

Important ruleset’s features are:

Number of rules
Number of conditions for each rule
Condition representation for each rule
Covering
Error

The Manager task dedicated to their visualization and editing is the Rule Manager.

Models¶

The models or modelset data structure is the data structure representing the machine learning black box models in Rulex Factory.

The structure of the modelset varies a lot, depending on the machine learning algorithms which is creating it.

The main models’ features are:

Model type
Models array
Model weights or coefficients
Model values or labels

Also models can be converted to a dataset format by using a dedicated task.

Results¶

The results data structure contains all the execution information on the machine learning algorithm which has been applied during the flow computation.

This is the only data structure which can be expanded instead of replaced when consecutive machine learning tasks are connected.

That’s why in any task producing results structures users can find the option Append results. If this option is selected, the new rows will be added to the original results data structure.

Moreover, results are organized according to the name of the task which produced them: each row corresponds to one statistical or computational result.

Some results may be different, according to the dataset modeling sets.

The number and the type of results vary, according to the task which produced them. More information on the possible results is provided in each task’s page.

Generally speaking, the results’ features are:

Task identifier and task name
Result name
Result modeling set
Result value

The Confusion matrix of the classification and regression tasks is stored in the results.

That’s why the best way to visualize results is using the Confusion Matrix task.

Other dataset-like data structures¶

The following sets of data structures,

Clusters
Cluster labels
Frequent itemsets
Frequent sequences
Associations rules
Auto Regressive Models
Discretization cutoffs
PCA eigenvectors

share the same format with the dataset structure, that’s why they are often called dataset-like structures.

The only difference between the dataset structure and the ones mentioned above is their propagation through the wires which link one task to the other.

As most of these dataset-like structures are created by only one task, the Convert Structure to Dataset task allows to convert them to datasets.

Some of these structures have a dedicated manager task: the Itemsets/Sequences Manager task visualizes/edits the frequent itemsets/sequences structures, while the Association Manager task visualizes/edits the Association rules structure.

Note

Even if they are called association rules, they are stored in dataset format rather than rules format.