Import from Parquet File#

The Import from Parquet File task allows users to import data stored in a Parquet file, whether they have a table layout or not.

The Import from Parquet File task is divided into two tabs:


The Options tab#

The Options tab follows the structure shown in the Import Overview page.


The Parquet Configuration tab#

The Parquet Configuration tab is divided into three panes:

Parsing options

Within this pane, users can set and configure the following options:

  • Missing string: users can enter the string they want to remove from the dataset.

  • Text delimiter: users can select ‘ or ” if these symbols have been used as string delimiters. They will not be included in the imported file. For example, the string “apartment” will be imported as apartment. This option will remove all instances of text delimiters in the string, and not only the initial and closing symbols. The only exception to this rule will be if the symbol is preceded by a backslash. For example, “ad"cb” will be imported as ad”cb, while “ad”cb” will be imported as adcb. The data type for values with string delimiters is nominal, and this data type will not be altered by the removal of text delimiters. For example, “3” will be imported as 3, but will remain a nominal value, instead of being converted to an integer.

Import options

Within this pane, users can set and configure the following options:

  • Start importing from line: the number of the line from which the importing operations will start.

  • Stop importing at line: the number of the line where the importing operations will end. Leave the value 0 if you want the whole dataset to be imported.

  • Column to be imported (empty for all): the number of columns to be imported. If left empty, all the columns will be imported.

  • Remove empty rows: if selected, it removes the empty rows from the imported dataset.

  • Add an attribute containing: if you select one of the available options, you’ll add an extra column to the dataset containing the name of the file, or row group index or both. Available options are:

    • Filename

    • Row group index

    • Both

  • Remove empty columns: if selected, it removes the empty columns from the imported dataset.

  • Case sensitive: if selected, upper cases are considered different from lower cases.

  • Strip spaces: if selected, it removes leading and trailing spaces from the string. For example, the string ” class ” will be imported as “class”.

  • Turn off smart type recognition: if selected, prevents automatic recognition of data types. This option is useful when manual identification is preferable, for example when there is the risk of a code being misinterpreted as a date.

  • Compress white spaces: if selected, it compresses contiguous occurrences of white spaces in one single occurrence. For example the string “university program” would be imported as “university program”.


Example#

  • Drag and drop an Import from Parquet File task onto the stage.

  • Double-click to open the task.

  • Select Custom from the Source slider.

  • Select Local File System from the drop-down menu.

  • Click Select and browse to the Parquet file you want to import.

  • Configure the selected task as explained above in the sections above.

  • According to the selected Parquet file, your Import from Parquet File task should look like the example provided below.

https://cdn.rulex.ai/docs/Factory/import-parquet.webp