Import Tasks¶
One interesting and very useful feature that characterizes Rulex Factory is being able to work with and within data. In order to do this, users need to import their data onto Rulex Factory.
Importing data into the Factory basically means importing data from an already existing source and transform these data into a table called dataset.
Data can be imported from many sources:
From file
From database
From other data structures of Rulex Platform itself.
By defining the new dataset from scratch.
Rulex Factory provides users with different import tasks that allow to perform these operations. First and foremost, users need to define what they want their data to look like and, based on this, decide which task will be used. Users can then easily visualize the selected task by dragging and dropping it directly onto the stage from the Task panel located on the right.
Import from File¶
Users can import their data directly from files. Rulex Factory supports many file formats:
Excel
Parquet
Pdf
Json
Text
Word
Xml
and this list is going to be increased in the future.
Consequently, the available Import from File Tasks in Rulex Factory are listed below:
While Rulex Factory presents a dedicated task for each different file format, the support for different type of sources connection (local or remote as Sharepoint, S3 buckets and so on) is integrated in every task. As the configuration of the connection to the file is shared by all Import from file tasks, it will be described in this page.
As previously explained, all the import from file tasks share some common features and options. Each import task has its own Options tab and Table Preview pane, which are the same for each task. However, the Configuration tab is different for each import task. To better understand how to configure the selected import task, please refer to the dedicated pages for each task.
The Options panel is divided into two more panes:
The File Options
Location controller
In the Location controller, users have the possibility to choose files or folders they want to import.
Before choosing the actual file or folder, users should define the filesystem to which Rulex Factory has to connect.
Filesystem connection can be defined by using two different connection types:
Saved connection
Custom connection
Note
If users decide to configure a Saved connection, first they need to retrieve from the Explorer their previously saved Filesystem Resource. By clicking on the three dots on the right, users can choose a filesystem resource they have already been saved in the environment. Indeed, these resources can be reused in other tasks or flows.
A Custom connection is a type of connection set up in the task itself. It allows to add a new connection to one of the supported remote Filesystems or to a Local Filesystem (e.g. your machine disk) and to use it in the current import task only.
If you decide to configure a Saved connection, first you need to retrieve your already saved Filesystem Resource from the Explorer. By clicking on the three dots on the right, users can choose a filesystem resource they have already been saved in the environment. Indeed, these resources can be reused in other tasks or flows.
See also
Please refer to the Filesystem page for further information about Filesystem Resource
Procedure
To import from a Custom source, users need to follow these steps:
Choose the filesystem you will be importing from. The supported filesystems are:
Sharedrive
Http Server
Ftp Server
Aws S3
Sharepoint
HDFS
Azure BLOB Storage
Sftp Server
Google Drive
Outlook
Local File System
Configure the connection of the chosen filesystem through the pencil icon located on the right. A different dedicated dialog window will appear to guide you in the definition of the remote connection. You will find more detailed information on how to configure a chosen filesystem in the FileSystem Resource dedicated page.
Once the connection to the Filesystem has been successfully completed, users can choose what files or folders to import. Rulex Factory supports:
Single file import: a single file is imported using the filesystem connection.
Multiple files import: several files from the same filesystem are imported and then concatenated using the concatenation options presented below.
Entire folders: entire folders can be selected too, the system will gather from the folder all the files with matching format and concatenate them in a single dataset.
To perform the file/ folder selection, click the SELECT button. The Path selection panel will be activated, only after performing a successful connection. When a successful remote connection hasn’t been established, the SELECT button is greyed out.
The Path selection panel consists of several entry lines where the user can either write or browse the relative path of his file or folder. Buttons located on the right side of the Path selection panel manage the number of lines:
Add new path button: it adds a new empty line where the user can define a new path.
Delete all paths button: it erases all previously defined paths. Only a default empty line remains.
The Delete icon located at the end of the line allows users to erase the corresponding line, or to clear out the content of the default first line.
To browse the Filesystem location and search for your path, the user can click the SELECT button located at the end of each line. This button will open a File explorer in a dedicated dialog window.
In the File explorer dialog window the user can inspect a filesystem tree. Folders and files are distinguished by the two different icons at the left of each tree line. Operation available for the user are:
A folder can be expanded to look into its content by left-clicking on its name. The icon will become outlined to indicate its content has been reveled.
A folder can be selected by right-clicking on its name or by clicking directly on the folder icon. The icon will become highlighted to indicate its selection. Moreover, all its parent folders will increase a counter located at the end of their name to show the number of files/folders selected under that location.
A file can be selected by left-clicking on its name or icon. The icon will become highlighted to indicate its selection.
A file can be double-clicked to automatically close the file explorer dialog window and to insert the selected path into the considered line.
To select more than one file or folder or a range of files/folders, press either Ctrl or Shift keys. Once the selection has been performed, the user can click Confirm to confirm the selection or Cancel to abort all the previous operations. If in the file explorer more than one file/ folder has been selected, then the number of lines in the Path selection panel will adjust accordingly.
Note
Depending on the version of Rulex Factory being used, the local filesystem works differently from other file sources:
In standalone version, users can select files from the local filesystem by clicking on SELECT and by using the OS file selector dialog box. To select folders, users need first to activate the folder selection mode by clicking on the Active select folder icon. If the folder selection mode is activated, then the Active select folder icon turns green; when clicking SELECT. users are able to see the OS folder selector which allows folder selection from the local disk.
In cloud version files located in the local filesystem are uploaded on the server, and therefore they freeze their status to the time of the manual user upload. In cloud, you can not select entire folders, but you need to manually upload directly their contents. For this reason the pane is slightly different since it does not present the Active select folder button and the button SELECT is substituted by the button UPLOAD.
See also
Within this panel, the Import From Excel File, in addition to the File section mentioned above, provides users with one more section: the Sheets tab. This tab allows users to further specify which sheets should be considered during the import phase. Further information about the content of this dedicated pane are present in the import from excel task dedicated page.
File Options
Defining the filesystem connection and selecting the imported files or folders are not the only general options that users can perform. Within the file options panel, users can also configure the import operation by managing it further.
As well as defining the file system connection and selecting files or folders to import, the File Options pane provides further customization options.
Concatenation options: it establishes how multiple files are concatenated in a single dataset.
Error policy: it defines how to proceed if an issue occurs during import phase.
Missing file behavior: it defines how the import from file task should behave when the selected file is missing.
The concatenation of multiple files can be configured by defining the following options
Concatenation type: Outer or Inner.
Inner concatenation: the concatenated dataset only includes attributes (i.e. columns) that are present in both datasets.
Outer concatenation: the merged dataset includes all attributes from both datasets. When information is missing, an empty cell is displayed.
Match columns by: it defines the way users want to match columns, according to their Name or Position.
If the Use old computation data if the source file is not available checkbox is selected, then data from a previous computation will be used, for example if an error is detected within the whole task. The error will then be downgraded to a warning in this scenario.
When unsure whether there is a target file or folder, the following options can be used:
Wait until the target file is present (poll every X seconds): if selected, Rulex polls the target file with the frequency specified (to be specified into the corresponding box) until it is available.
Continue even if the folder is empty: if selected, computation of the task continues, even if the selected source files are not available. And, in this case, an empty dataset is returned from the import operation.
Table Preview Panel
In this area, users will be able to visualize a preview of their imported tables. The number of lines shown in this preview is controlled by Number of records in preview number field.
Not all previews are automatically generated. In the Import from JSON and Import from XML tasks, the preview is displayed by clicking on the LOAD PREVIEW button.
Import from Database¶
As mentioned above, Rulex Factory has the possibility to retrieve an imported dataset starting from tables and data stored into a SQL Database. Rulex Factory allows this operation through two different tasks:
the Import from Database task.
the Conditional Import task.
The main differences between the two tasks is the presence in the Conditional Import task on an entering input dataset which is used to further specify the SQL query to be executed during the import database phase. Database connection, however, is equal in both the tasks, and it will be explained in this section.
Warning
All SQL databases require the installation of their corresponding ODBC drivers to work correctly.
All these two import tasks are constituted by two panels: the Options panel and the Table preview panel.
The Options panel is divided into two more panes: the database controller and the Table Options.
Database controller
In the Database controller, users can import tables or perform queries on the previously connected SQL database.
Users need to define a database connection. Database connections can be defined by using two different connection types:
Saved connection
Custom connection
When selecting a saved connection, users need to retrieve the previously saved database resource from the Explorer. By clicking on the three dots on the right of the database controller, users can select one of the database resources previously saved in the environment.
See also
Please refer to the database page for further information about Database Resource.
Custom connections are connection to a database, valid in the current task and its child tasks only.
Procedure
To import from a Custom source, users need to follow these steps:
Choose the database type you will be importing from. The supported databases are:
SQLite
Oracle
MySQL
SQL Server
PostgreSQL
IBM DB2
IBM DB2 AS400
Azure Synapse Analytics
Impala
Spark
Hive
Teradata
OpenText Gupta SQLBase
Microsoft access
SAP Hana
Generic ODBC Connection
Configure the connection of the chosen filesystem through the pencil icon located on the right. A different dedicated dialog window will appear as a guide in defining the database connection. You will find more detailed information on how to configure a chosen database in the Database Resources dedicated page.
Once the database connection has been established, users can choose to import the entire table or views, or specify a Select query to be executed. Based on this selection, users then can configure one of the two tabs provided, located just below the connection configuration row. Use the Table or Query tab buttons to switch between the two modes; to import an entire table, use the Table tab, to write and execute queries, use the Queries tab.
Warning
If both panes are configured, the system prioritizes queries execution, and Table pane option will be ignored.
When selecting the Table tab, the list of tables will be displayed; users can select the tables to be imported by clicking on them. If they want to select more than one table, they need to click the Ctrl key.
On the right side of the Table pane, users will find the following options:
Check all: it selects all the available tables.
Uncheck all: it unselects all the previously selected tables.
Invert: it inverts the whole selection, making selected tables unselected and vice versa.
If users want to select more than one table, the final dataset will be formed by the concatenation of all the imported tables, according to the Concatenation Options set in the Table Options pane.
On the Query pane instead, users can write their SQL SELECT queries. Within this pane, the following options are available:
Add query: to add a new query, users can click on this button located on the right side of the pane.
Once a query has been added, users can type their SQL code in the box provided.
The query code can be deleted at any time by clicking on the Cross icon on the left of the pane. If the query is already empty, clicking on the same button will delete the entire query line.
Delete query: to delete the last query, users can click on this button, located on the right side of the pane.
If users define more than one Query, the final dataset is made of the concatenation of all the data fetched by the given queries, according to the concatenation options set in the table options pane.
Tip
SQL Query code boxes performs an intelligence highlight thanks to the graphic engine MONACO, the same mounted on the most important programming IDE.
Table Options
The Table Options pane allows users to customize the import options. It is divided into two panes:
Concatenation options, where users can find the import options.
Customization pane
This pane allows full customization when performing an import operation. Users can set the following options:
Bulk size for prefetch: it establishes the number of rows which are fetched together from the database. An high number will increase the import performance, but it will cost major resource in terms of RAM and CPU on the database server.
Case sensitive: it controls if the final dataset should be case sensitive.
Strip spaces: it controls if in the final dataset all the spaces located at the beginning or at end of any string entry must be erased. Moreover, it will strip also any character which can not be encoded using UNICODE format.
Compress white spaces: it controls if in the final dataset all the continuous spaces located in the middle of any string entry must be reduced to a single space.
Turn off smart recognition: it controls if in the final dataset text column are cast to a different type according to the values they contain.
The Use old computation data is the source file is not available checkbox allows using data from the previous computation if an error is returned by the whole task. The error will be displayed as a warning.
Concatenation options
This pane allows users to select the concatenation type in case of multiple tables or queries, or missing data in the table.
Concatenation of multiple tables or queries can be configured by customizing the following options:
Concatenation type: users can choose between outer* or inner concatenation.
Inner concatenation: the concatenated dataset contains only attributes (i.e. columns) that are present in both datasets.
Outer concatenation: the merged dataset contains all attributes from both datasets. Where data are missing, an empty cell will be displayed.
Match columns by: it defines the way users want to match columns, according to the Name or Position.
The Repeat query execution option allows users to define that the import will be repeated every X seconds (provided in a dedicated field) until the final dataset contain a precise number of rows (provided in the last dedicated number field).
Table Preview Panel
Within this pane, users will be able to visualize a preview of their imported tables. The checkbox Number of records in preview allows users to decide how many rows will be displayed in the above-mentioned table preview.
Note
The table preview is not automatic. To load it, click on the LOAD PREVIEW button located at the center of the Table preview area.
Import from Rulex Platform¶
Rulex Platform is an overall platform of data analysis. It stores data and models in various form. We have called these forms data structures and each of them carries information about one type of data or ML model. In some cases, it could be useful to import these data structures even in tasks not directly connected with the source of this information. Moreover, the same mechanism can share data information between Rulex Platform components or resources.
Rulex Platform is built to fully analyze and manage data. It stores data and models in different forms known as data structures, each carrying information about a specific type of data or ML model. Importing these data structures can be useful for tasks not directly related to the source of this information. Moreover, this mechanism enables seamless data information sharing between different Platform components or primary resource.
Rulex Factory allows this operation through two different tasks:
the Import from Task task: where we are going to gather some data structures from a target task, and we are going to import them in the selected task, eventually by converting one of it in the output Dataset.
the Import from View task: where we are going to import a table from Rulex Studio.
For more detailed and in-depth information, please refer to the dedicated pages: Import From Task and Import From View.
Create a dataset from Scratch¶
In some scenarios, it may be necessary to have a dataset filled through an automatic process, rather than importing one from a file or database table. To perform this operation, the Empty Source task allows users to create their own dataset from scratch by specifying the number of rows and columns.
For more detailed and in-depth information, please refer to the dedicated page Empty Source.
See also
In the Import task family, users will also find the Rulex Flow File Source, a module whose characteristics are different from the other import modules. It belongs to this family because, unlike many import tasks, it does not receive any input. For more information about this task, please refer to the Module overview page or to its dedicated page.