Data Functions

Data functions operate on the data types or values.

The following functions are available:


cast

The cast function casts the type of the selected attribute to the specified data type.

This function can also be used when the attribute is empty, to ensure the correct data type is assigned to the column, as an empty column will be otherwise be assigned the nominal data type.

Parameters

cast(column, newtype, forced)

Parameter

Description

column

The attribute whose values we want to cast. The column parameter is mandatory.

newtype

The datatype the attribute will be cast to. The newtype parameter is mandatory.Possible values, for this parameter are: “integer”, “nominal”, “continuous”, “long”, “double”, “binary”, “ordinal”, “percentage”, “currency”, “date”, “week”, “month”, “quarter”, “datetime”, “time”, “index”, “boolean”, “int”, “float”, “short”, “string”.

forced

If set to False, any cast operations will would result in the loss of data precision (e.g. converting a continuous value to an integer) will not be performed, and an error message will be displayed. This parameter is set to True by default.

Example - cast(column, newtype, forced)

The following example uses the Car Sales dataset.

  • In the car_sales dataset, the Sales_in_thousands attribute contains continuous values with detailed numbers.

  • If we cast this continuous attribute to an integer we will round off the number into the nearest a thousand.

  • The required formula is cast($"Sales_in_thousands", "integer").

  • If we entered the value False for the forced parameter, the cast operation would not be performed, as it results in a loss of data precision of the continuous attribute values.

  • The formula used in this case is cast($"Sales_in_thousands", "integer", False). An error message will be displayed at the bottom of the screen.


catNames

The catNames function looks for values in specific attributes, and returns the headers of the attributes where the values were found. All the corresponding headers are concatenated.

This function is useful for finding values in very large datasets.

Parameters

catNames(indatt, values, separator, negate)

Parameter

Description

indatt

The attributes whose values we want to search in. Multiple attributes must be included in square brackets, such as [$"hours-per-week","education-num"]. The indatt parameter is mandatory.The dollar sign is not required before the attribute names.

values

The values which will be searched in the attributes indicated in the indatt parameter. The newtype parameter is mandatory.

separator

The required separator to divide the returned headers. Symbols or strings can be used, for example separator = "and".The default value is “-“.

negate

If set to True, instead of returning the headers of attributes where the value is present, the function will return the headers of attributes where the specified value is not present. The default value is False.

Example - catNames(indatt, values)

The following example uses the Adult dataset.

  • In the adult dataset, we are looking for the value “13” in the two attributes hours-per-week and education-num. We have added a new attribute, called Concatenate, to contain the results.

  • The formula will consequently be: catNames(["hours-per-week","education-num"],13).

Example - catNames(indatt, values, separator)

The following example uses the Adult dataset.

  • In the second example, we will then change the default “-” symbol to concatenate values to “,”.

  • The formula will now be : catNames(["hours-per-week","education-num"],13, separator=",")

Example - catNames(indatt, values, separator, negate)

The following example uses the Adult dataset.

  • In this last example, we want to return the headers of the attributes where the value 13 is not present in any of them.

  • The formula will now be: catNames(["hours-per-week","education-num"],13, separator=",", negate = true)


decideType

Assigns the correct data type to an attribute, depending on the values it contains.

Parameters

decideType(column)

Parameter

Description

column

The attribute for which we want to define the correct data type. The column parameter is mandatory.

Example - decideType(column)

The following example uses the Age_BMI dataset.

  • In the Age_BMI dataset, we have changed the data type of the Age attribute to nominal, for the sake of the exercise.

  • We can now use the decideType function to define the correct data type for the attribute, according to the values it contains.

  • The function is simply decideType($"Age").

  • The integer data type has been correctly redefined.


disc

The disc function discretizes data values into ranges defined by cutoff values.

Parameters

disc(column, cutoffs, rank)

Parameter

Description

column

The attribute whose values we want to discretize. The column parameter is mandatory.

cutoffs

The cutoff values that will be used to discretize values into ranges. All cutoff points must be enclosed in square brackets. The cutoffs parameter is mandatory.

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

Example - disc(column, cutoffs)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • In the Age_BMI dataset, we have added a new attribute, called Disc_BMI, to the dataset to contain the discretized values of the BMI attribute.

  • In this new attribute we have divided the BMI values into 4 different ranges, using the cutoff values 18, 25 and 31, with the formula disc($"BMI", [18, 25, 31]).

  • The resulting 4 groups display the central value for each range.

Example - disc(column, cutoffs, rank)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • If we want to display a ranking number for each range, we simply set the rank parameter to True, instead of leaving its default value.

  • The formula will consequently be: disc($"BMI", [18, 25. 31], True).


discEqualFrequency

The discEqualFrequency function discretizes data values into bins which contain the same number of values in each.

Parameters

discEqualFrequency(column, nvalue, rank, quantile)

Parameter

Description

column

The attribute whose values will be discretized. The column parameter is mandatory.

nvalue

The number of bins you want to create. Cutoff values will be automatically created to discretize values into ranges. All cutoff points must be enclosed in square brackets. The nvalue parameter is mandatory.

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

quantile

If set to True, the values will be discretized in quantiles. By default, this parameter is set to False. This option has a particular impact when there are many identical values in a dataset. Standard discretization will put identical values in a single bin, and perform discretization on the remaining values, while if the quantile parameter is used, multiple bins may have the same central value, if this value is found in a high percentage of rows in the dataset.

Example - discEqualFrequency(column, nvalue)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • In the Age_BMI dataset, we have added a new attribute, called Disc_Age, to the dataset to contain the discretized values of the Age attribute.

  • In this new attribute we have divided the Age values into 5 different bins, which each contain the same number of values, with the formula discEqualFrequency($"Age", 5).

  • The resulting 5 groups display the central value for each bin: 35, 40, 46, 52, 58.

Example - discEqualFrequency(column, nvalue, nrank)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • If we want to display a ranking number for each range, we simply set the rank parameter to True, instead of leaving its default value.

  • The formula will consequently be: discEqualFrequency($"Age", 5, True).

  • The resulting 5 bins will now have a ranking number from 1 to 5.


discEqualWidth

The discEqualWidth function discretizes data values into bins of equal width.

Parameters

discEqualWidth(column, nvalue, rank, min, max)

Parameter

Description

column

The attribute whose values we want to discretize. The column parameter is mandatory.

nvalue

The number of bins you want to create. Cutoff values will be automatically created to discretize values into ranges. All cutoff points must be enclosed in square brackets. The nvalue parameter is mandatory.

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

min

If required, you can set a minimum value, which may not correspond to the values currently present in the dataset. For example, in our dataset the minimum value for the age attribute is currently 32, but we would like to create bins that range from 18 years of age, as this is the potential age range of our survey. In this case, you can set the minimum value to 18.

max

If required, you can set a maximum value, which may not correspond to the values currently present in the dataset. For example, in our dataset the maximum value for the age attribute is currently 61, but we would like to create bins that range up to 80, as this is the potential age range of our survey. In this case, you can set the maximum value to 75.

Example - discEqualWidth(column, nvalue)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • In the Age_BMI dataset, we have added a new attribute, called Age_Disc, to the dataset to contain the discretized values of the Age attribute.

  • In this new attribute we have divided the Age values into 5 different bins of equal width, with the formula discEqualWidth($"Age", 5).

  • The resulting 5 groups display the central value for each bin: 35, 41, 47, 52, 58.

Example - discEqualWidth(column, nvalue, rank)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • If we want to display a ranking number for each range, we simply set the rank parameter to True, instead of leaving its default value.

  • The formula will consequently be: discEqualWidth($"Age", 5, True).

  • The resulting 5 bins will now have a ranking number from 1 to 5.

Example - discEqualWidth(column, nvalue, rank, min, max)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • We can change the range of values considered when defining bins, by setting values for the minimum and maximum values.

  • For example, in our dataset the range of values for our Age attribute is from 32 to 61 years, but we would like to create bins that range from 18 years up to 75, as this is the potential age range of our survey.

  • In this case, you can set the min parameter to 18, and the max to 75. The formula in this case would be discEqualWidth($"Age", 5, False, 18, 75). If you want to leave the default value for the rank parameter, and just set the last two min and max parameters, you do not need to specify a rank value, but by doing this you are no longer able to identify parameters by their position, otherwise 18 would appear to be the value for the rank. To get around this, you simply need to add the name of the parameter you are defining to the function, after the skipped parameter: discEqualWidth($"Age", 5, min=18, max=75).

  • In our example, 5 bins have been created, but there are only values currently available for 3 of the 5 groups (no values in the lowest and highest value bins, for example from 18 to 30, and from 65 to 75).


discretize

The discretize function discretizes values of a selected attribute into bins of equal width, or with the same number of values or according to cutoff values.

See also

For more details on each specific discretization method, check out the specific topics:

Parameters

discretize(column, nvalue, cutoffs, mode, rank, quantile, min, max)

Parameter

Description

column

The attribute whose values we want to discretize. The column parameter is mandatory.

nvalue

The number of bins you want to create. This value can be any number, up to the number of rows in the dataset. The nvalue parameter is mandatory if the mode parameter is ef (Equal Frequency) or ew (Equal Width).

cutoffs

The cutoff values that will be used to discretize values into ranges. All cutoff points must be enclosed in square brackets.

mode

The method you want to use to discretize values. Possible values are:

  • “ef” for Equal Frequency (default value)
  • “ew” for Equal Width.

rank

By default, the central value of each range is displayed. If instead we want to display a ranking number for each range, the rank value must be set to True. It is False by default.

quantile

If set to True, the values will be discretized in quantiles. By default, this parameter is set to False. This option has a particular impact when there are many identical values in a dataset. Standard discretization will put identical values in a single bin, and perform discretization on the remaining values, while if the quantile parameter is used, multiple bins may have the same central value, if this value is found in a high percentage of rows in the dataset.

min

If required, you can set a minimum value, which may not correspond to the values currently present in the dataset. For example, in our dataset the minimum value for the age attribute is currently 32, but we would like to create bins that range from 18 years of age, as this is the potential age range of our survey. In this case, you can set the minimum value to 18.

max

If required, you can set a maximum value, which may not correspond to the values currently present in the dataset. For example, in our dataset the maximum value for the age attribute is currently 61, but we would like to create bins that range up to 80, as this is the potential age range of our survey. In this case, you can set the maximum value to 75.

Example - discretize(column, nvalue, mode=mode)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

Note

Given the high number of parameters in the task, it is important to use keywords to identify the parameters we are providing values for (e.g. mode=”ef”), so we do not need to include them all in every function.

  • In the Age_BMI dataset, we have added a new attribute, called Age_Disc, to the dataset to contain the discretized values of the Age attribute.

  • In this new attribute we have divided the Age values into 5 different bins of equal width, with the formula discretize($"Age", 5, mode="ef").

  • The resulting 5 bins display the central value for each bin: 35, 40, 46, 52, 58.

Example - discretize(column, nvalue, mode=mode)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • In this example, we will create another attribute, called BMI_Disc, which we can use to display the results of the discretized values of the BMI attribute.

  • This time we will use the Equal Width discretization method to create 4 bins with the same range of values.

  • The formula will consequently be: discretize($"BMI", 4, mode="ew").

  • The resulting 4 bins will display the central values for each bin: 19, 24, 30, 35.

Example - discretize(column, nvalue, cutoffs)

The following example uses the Age_BMI dataset. This dataset has been extracted from the public Hepatitis C Virus (HCV) for Egyptian patients dataset available on Kaggle.

  • In the final example, we will again divide the BMI attribute into 4 bins, using the Disc_BMI attribute to display the results, but this time using cutoff values that we will insert into the formula.

  • To create 4 bins we must enter three cut-off points, which in our example are 20, 25 and 30.

  • The formula will consequently be: discretize($"BMI", 4, [20,25,30]).

  • The results display the central value for each range created using the cut-off values we supplied.


isAttribute

The isAttribute function checks whether a specified attribute is present in the dataset. The result can either be returned as a Boolean (True / False) or binary (0/1) result.

This function is useful for large datasets.

Parameters

isAttribute(name, binary)

Parameter

Description

name

The name of the attribute we are looking for. The name parameter is mandatory.

binary

If set to True, or simply not specified, results will be provided as a Boolean (True / False), while if set to False, results will be provided in binary form (1/0).

Example - isAttribute(name)

The following example uses the Adult dataset.

  • In the adult dataset, we have added a new attribute, called Present_in_dataset, where the results of the isAttribute function will be saved.

  • We will then check to see whether there is an attribute called education in the dataset, with the formula isAttribute("education").

  • As we have not specified a value for the binary parameter, the results will be expressed as a Boolean value: True in this specific case.

Example - isAttribute(name, binary)

The following example uses the Adult dataset.

  • If we wanted to receive the result in boolean 0/1 value form instead of a True / False, we would need to specify that the binary parameter should be False, with the following formula: isAttribute("education", False).


isFloat

The isFloat function checks whether the data type of the attribute specified in the string parameter is continuous.

Parameters

isFloat(string, binary)

Parameter

Description

string

The nominal attribute to be tested. The string parameter is mandatory.

binary

If it is True, or not specified, results will be provided as boolean values (True / False), while if it is False, results will be provided in binary form (1 / 0). The binary parameter values (True / False) are case sensitive.

Example - isFloat(string, binary)

The following example uses the Sales Videogames dataset.

  • In this example, we want to check if the attribute LowUserPrice is a Float attribute.

  • We add a new attribute and write the following formula:isFloat($"LowUserPrice",False).

  • We specified False as binary parameter, so the results will be provided as 1 (True) or 0 (False).


isType

The isType function checks whether the data type of the attribute specified in the string parameter corresponds to the data type specified in the type parameter.

Parameters

isType(string, type, binary)

Parameter

Description

string

The nominal attribute to be tested. The string parameter is mandatory.

type

The attribute type we want to verify on the string. The type parameter is mandatory.

binary

If the optional binary parameter is True, or not specified, results will be provided as boolean values (True / False), while if it is False, results will be provided in binary form (1 / 0). The binary parameter values (True / False) are case sensitive.

Note

Attribute types you can verify in this function:
  • nominal

  • integer

  • continuous

  • binary

  • percentage

  • currency

  • date

  • datetime

  • month

  • week

  • quarter

  • time

Example - isType(string, type)

The following example uses the IsFunctions dataset.

  • In this example, we want to check if the Month attribute is an integer attribute.

  • Add a new attribute where we can use the isType function.

  • The function has returned False, because the attribute Month hasn’t got numeral values.


type

The type function returns the data type of the selected column as a nominal value.

Parameters

type(column)

Parameter

Description

column

The attribute whose data type we want to return. The column parameter is mandatory.

Example - type(string, type)

The following example uses the Adult dataset.

  • In the adult dataset, we have added a new type attribute, where the data type of the education-num attribute has been retrieved, using the formula type($"education_num").

  • The data type has been correctly returned as integer (expressed as a nominal attribute as “integer” is a string value).