Similar Items Detector#

Rulex Factory generates description-based and sales-based replacement rules with the Similar Items Detector task.

This task uses description-based matching, which can be used with newly introduced items and helps solve cold start problems.

Warning

This task must be linked to a Frequent Itemsets Mining task, which provides the input data for the analysis.

The task is divided into three tabs:

  • the Options tab, where you can configure the task using a Text based matching or Sales based matching method.

  • the Replacement rules tab, where you can visualize the generated replacement rules and their details.

  • the Results tab, where results on computation can be visualized.


The Options tab#

The Options tab contains all the task’s features which can be customized to obtain the desired output.

It is divided into two smaller tabs: the Text based matching and the Sales based matching tabs.

The Available attributes list is always available, no matter which tab is opened.

The Text based matching tab

This tab allows to configure the features of the description-based replacement rules. It is divided in two areas: the attribute drop area and the options list.

In the attribute drop area, the following panes are provided:

  • Item key attributes (NOMINAL): drag the nominal attributes that uniquely identify the item from the Attributes list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

  • Preferential requirements attributes (NOMINAL): drag the attributes which will influence the similarity score when they match. When they match, a weight is added to the similarity score. This weight is defined in the Preferential requirements weights.
    These attributes could, for example, define brand, packaging or size.
    Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

In the options list, the following features are provided:

  • Category attribute: the attribute that represents the category. This can be used to match only descriptions that belong to the same category. To know more on the attribute drop-down menu, go to the corresponding page.

  • Description attribute: the attribute that represents the description, which will be used for text matching. To know more on the attribute drop-down menu, go to the corresponding page.

  • Word separator: select how words are separated from one of the following possibilities:
    • Space

    • Tab

    • Newline

  • Minimum word length (0: no minimum length): words that are shorter than the value entered here will not be used for text matching. This helps to eliminate words such as ‘the’, ‘a’, ‘one’, ‘at’ etc.

  • Minimum unadjusted similarity cosine (0-1 range): the minimum similarity of pure text matching, without considering Preferential requirements attributes.
    Entering 1 means the text must be identical, 0 corresponds to no match required.

  • Case sensitive matching: if selected, the upper or lower case will be taken into consideration when matching text.

On the right, another panel is provided, containing:

  • Ignored char list: select the characters to delete from text matching. Possible options are:
    • Special characters

    • Single numbers

    • Alphabet letters (both upper case and lower case letters)

The Sales based matching tab

This tab allows to configure the features of the sales-based replacement rules. The following options are provided:

  • Take also sales data into account: select this option to include sales data in the task execution.

  • Minimum alternativeness coefficient:the degree of alternativeness between the purchase of two items:
    • 1 (max) if they are never sold together

    • 0 (min) if one item is always sold with the other one.

If a pair of items ensures the Minimum alternativeness coefficient, the corresponding replacement rule is discarded. This option is activated only if the Take also sales data into account option is selected.

  • Minimum volume replacement score %: the minimum percentage of orders in which a replaced item is expected to be replaceable by the replacing item. If this minimum threshold is not satisfied by a replacement rule, it is discarded. This option is activated only if the Take also sales data into account option is selected.


The Replacement rules tab#

The Replacement rules tab displays the generated replacement rules in a spreadsheet format.

The following attributes can be found:

  • Replacement rule ID: the ID of the replacement rule.

  • Category: the value of the attribute specified in the Category attribute area, assigned to the corresponding replacement rule.

  • Replaced item ID: the ID of the value to be replaced.

  • Replaced item support #: the support of the item to be replaced.

  • Replaced item loneliness %: the loneliness % of the value to be replaced.

  • Replacing item ID: the ID of the replacing value.

  • Replacing item support #: the support of the replacing item.

  • Alternativeness coefficient: the alternativeness coefficient value.

  • Item name: description replaced item: the value of the replaced attribute, specified in the Description attribute attribute drop down list.

  • Item name: description replacing item: the value of the replacing attribute specified in the Description attribute attribute drop down list.

  • Similarity cosine: the similarity cosine value.

  • Score contribution: the contribution of each attribute in the Preferential requirements attributes area, if some attributes have been dragged there.

  • Volume replacement score %: the percentage of the replacement score.


The Results tab#

The Results tab provides information on the computation. It is divided into two sections:

  • The General Info tab, where the following information can be found:
    • Task Label: the task’s name.

    • Elapsed time (sec): the time required for latest computation (in seconds).

  • The Result Quantities tab contains the data quantities: check the results to be visualized, then open them by clicking on the arrow button to visualize the quantities’ values. The following information is provided:
    • Number of different items in input

    • Number of different orders in input

    • Number of generated frequent itemsets