Check Dataset¶
The Check Dataset task allows comparing two datasets, and produces a dataset where all the differences found in the two datasets are listed. More information about the output dataset can be found at the output dataset section.
The task layout is very simple and is made of only one tab, the Options tab.
Important
In this task, the old and the new datasets are, respectively, the first and the second dataset which have been linked to the task.
The Options tab¶
The Options tab contains all the settings available to customize the dataset comparison.
The following options can be found:
Perform only general comparison: if selected, only a general comparison between the columns of the two datasets is performed. If this option is selected, the other ones are greyed out.
Ignore missing values in comparison: if selected, if one of the two datasets contains some missing values, they are not considered in the differences retrieving process.
Consider column order in comparison: if selected, the columns with the same name located in different positions in the two datasets are considered as different columns.
Threshold value for doing row comparison: determines the threshold value to consider two rows as similar in the comparison. It indicates the diversity percentage between the rows in the old dataset and in the new dataset. Its value must be included between 0 and 1, where 1 indicates 100%.
The higher the value, the higher the probability that the output dataset contains, in the Changes column, modifications rather than additions/deletions.
The lower the value, the lower the probability that the output dataset contains, in the Changes column, additions/deletions rather than modifications.Number of decimals for threshold value: the number of decimals considered for the threshold value.
The output dataset¶
After its computation, the task produces a dataset, where all the differences found in the two datasets are listed.
It can be visualized both by right-clicking on the task onto the stage and selecting Take a Look or by linking a Data Manager task to the Check Dataset task.
The output has a defined structure, made up of the following attributes:
Change: the description of the change occurred between the old and new dataset.
Old Row: the number of the row in the old dataset, contained in the corresponding attribute, which is indicated in the Column attribute. This column is left empty if the Perform only general comparison option has been selected.
New Row: the number of the row in the new dataset, contained in the corresponding attribute, which is indicated in the Column attribute. This column is left empty if the Perform only general comparison option has been selected.
Column: the name of the attribute which has changed from the old dataset to the new one.
Old Value: the old value of the Change in the old dataset.
New Value: the new value of the Change in the new dataset.
Example¶
Import the two datasets into the flow.
- Add a Check Dataset task to the flow and link both the import tasks to it:
The task located on top of the stage has been connected first, so it is considered as the Old dataset.
The task located under the Old dataset has been connected last, so it is considered as the New dataset.
Double-click on the Check Dataset task to open it and check the Perform only general comparison option.
Save and compute the task.
- Then, right-click on the Check Dataset task and select Take a look. The results can be read as follows:
The Number of Rows (in the Change column) has changed from 19 (Old Value) to 20 (New Value),
The Number of Columns (in the Change column) has changed from 5 (Old Value) to 7 (New Value),
The REMOVED attribute (in the Column column) has been Deleted (as found in the Change column),
and so on for the other rows.
- If a more detailed comparison is required, double-click again onto the Check Dataset task and configure it as follows:
Uncheck the Perform only general comparison option.
Check the Ignore missing values in comparison option.
Leave the other options as default.
Save and compute the task.
- Right-click on the task and select Take a look. The output spreadsheet has more rows now, and also the Old Row and New Row columns have been filled. Results can be read as follows:
The row number 19 in the old dataset (Old Row) has been moved (Change column) to row 20 in the new dataset(New Row),
In the attribute Var_3 (in the Column column), the row position 7 hasn’t changed (as shown in the Old Row and New Row columns), but its value in row 7 has changed from VALUE2 in the old dataset (Old Value column) to VALUE1 in the new dataset (New Value column).
and so on for the other rows.