The Data Manager#

The Data Manager is a central task designed to perform several data management operations.

With the Data Manager, users can:

  • see the data behind your data management flow in an enhanced sheet base view.

  • understand whether all the required data for your model are already included in your data tables, or whether you need to enrich the data tables with additional attributes created through formulas.

  • aggregate multiple rows to condense information into fewer more significant rows, for example by aggregating all the rows corresponding to a customer in a single row using the Group and Apply operations.

  • check if data are clear and coherent, for example by checking the attribute types are correctly defined. Note that an incorrect data type may have been assigned automatically due to one or more values having been entered with an incorrect format. If you try to change the data type to the correct one, Rulex Platform will indicate which row contains the format error.

  • standardize the way missing values are expressed (for example, missing values can be represented with the letters “n/a”, or a question mark).

  • explore data in the Plots or Sheet tabs, to check visibly if the data at hand are appropriate for solving your problem, and detecting and removing any abnormal data (i.e. outliers), which may alter the generated models. Outliers often contain valuable information about the process under investigation or the data gathering and recording process. Before considering removing any of these points from the data, one should try to understand why they appeared and whether similar values are likely to continue to appear. Of course, outliers are often bad data points.

Thanks to the history management of Rulex Factory manager tasks, any operation performed in the Data manager is saved in code form and can be executed on new data at any time. This possibility allows the user to study necessary operations on training data and then to apply them on real or production data without any extra effort. Delivery of developed solutions is then straightforward with tools offered by the self-code framework of Rulex Factory.

Data Manager is divided into two main areas:

An additional pane, The Modeling Sets bar, located on the right, next to the main Data pane’s tabs, allows users to select and define different subsets of their dataset which are useful for modeling (training, test or validation sets).


The Attribute List#

Located at the left of the screen, the attributes’ pane displays a list of all the available attributes in the current dataset, and allows you to organize them by changing their position, their type, their role. It also allows you to search and sort them, or to add and delete attributes to the current dataset.

For more information on the attribute list user interactions, see page: The Attribute List. A brief introduction about the structure of this pane is also present here in this section. The attribute list is divided into two different sub-lists:

  • Attributes sublist, which contains all the columns imported by external resources or created by them using pre-processing transformation.

  • Results sublist, which contains all the columns automatically added by a particular machine learning Rulex Factory task.

Hint

Even though these two sub-lists are graphically separated, all operations can be applied indistinctly to either of these two groups without involving any modification. This division is primarily a logical division to help users immediately understand what information is provided and what results are generated.

These sub-lists can be ordered by using the drop-down menu located at the bottom of the attributes’ pane. Possible sorting are:

  • Attribute (is the default position, i.e. the original column position of the attribute in the dataset)

  • Name

  • Type

  • Ignored

  • Role

Each of these ordering will preserve the distinction between attributes and results. The ordering criteria are then applied separately for each sublist (for further information about attribute ignoring and attribute role properties please see page: The Attribute List)

Next to the top label Attribute list, a magnifying glass icon enables the user to search and filter the whole list according to the presence of the entered string. This filter operation only applies to the list itself and has no effect on the number or the position of the column displayed in the main Data pane. As explained here, the selection of the column to be shown in the main tab is controlled by the checkboxes located to the left of the attribute name rather than by the search feature.

See also

The description of drag-and-drop operations and the dedicated right-click context menu is provided on the following page: The Attribute List


The main Data pane#

The main Data pane shows, organized into different tabs, the entire list of operations that a Data Manager can apply to your data. On the first main tab, functions, queries and direct transformations are performed on rows and columns, while the subsequent tabs cover various data inspection aspects, such as plot representation and statistics.

https://cdn.rulex.ai/docs/Factory/main-data-pane.webp

The tabs listed in this pane are the following:

  • The Data tab: displays the data as a spreadsheet. It can be switched to the Attributes tab by clicking on the double-arrow button, where you can bulk edit characteristics of the attributes. For more information on the Attributes tab, see page: The Attributes tab.

  • The History tab: displays all the operations performed during the current session, and allows you to move or delete some of them.

  • The Plots tab: displays the plots on data as defined in the Plot Manager.

  • The Sheets tab: displays the statistics on data as defined in the Sheets tab.

See also

History management is a common feature of all Manager tasks. Therefore, the behavior of the History tab is explained in detail in the manager task overview.


The Modeling Sets bar#

In the application of a machine learning model, it is customary to define three different segments of rows called:

  • Training Set

  • Test Set

  • Validation Set

In Rulex Platform, these are referred to as Modeling sets.

The Modeling Sets bar allows you to filter data according to the different model segment each row belongs to. You can also define new modeling sets based on the filter applied in the Query Manager.

Tip

Modeling Sets bar is located in the Data manager header bar tab. If you want to collapse it, click on the Arrow icon located at the far left of the Modeling Sets bar.

The Modeling Sets bar contains four different icon buttons:

  • All which shows all the rows of the dataset

  • Training which shows only the Training set

  • Test which shows only the Test set

  • Validation which shows only the Validation set

By right-clicking on any of these buttons, except for All, you are given access to the following command:

  • Assign displayed rows to <button name> set: this entry assigns the displayed rows to the specified modeling set.