Frequent Itemsets Mining#

Frequent itemset mining extracts frequent item associations from a dataset. Rulex uses the Equivalence Class Transformation (Eclat) algorithm to perform this task.

A typical scenario in which this task could be applied is in defining which items are frequently bought together in a supermarket.

The output would be a table of itemsets which are bought in the same transaction more than a specified number of times. However, the task can be used in many other scenarios, whenever it is possible to identify attributes which define groups (Order key attributes) and attributes that populate these groups with information (Item key attributes).

Rulex can handle both:

  • generalized frequent itemset mining, where the items refer to different attributes and consequently carry different information

  • hierarchical frequent itemset mining, where the attributes carry the same information with different levels of detail.

The task is divided into three tabs:

  • the Options tab, where you can configure the analysis features;

  • the Frequent itemsets tab, where you can visualize the created itemsets in a spreadsheet format;

  • the Results tab, where results on the computation are provided.


The Options tab#

The Options tab contains all the task features which can be customized to obtain the desired output.

It is divided into three tabs: the Basic, the Advanced and the Output tabs.

The Available attributes list is always displayed, no matter which Options tab is opened.

Basic tab

In the Basic tab, you can distinguish between three panels: the Available attributes, containing the dataset’s attributes, which can be dragged onto the second panel, the attribute drop area, to start the analysis. One last panel with general options for analysis is provided.

Drag the required attributes for analysis on the attribute drop area. The following areas are provided:

  • the Order key attributes (NOMINAL): the nominal attributes which define orders. Instead of manually dragging attributes, they can be defined via a filtered list.

  • the Item child attributes (NOMINAL): the nominal attributes which characterize items. Instead of manually dragging attributes, they can be defined via a filtered list.

  • the Item parent attributes (NOMINAL): the nominal attributes which correspond to the hierarchically superior level of the attribute inserted in the Item child attributes. For example, if the analysis involves EAN codes and categories, the EAN code is dragged and dropped onto the Item child attributes list, while the category is inserted in the same position of the Item parent attributes list. If the parent item attribute is not defined for any child instances (i.e. an EAN is not categorized), the child attribute value is repeated in the parent attribute column. This list is enabled only if the Hierarchical item attributes option is selected. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

The options panel is located under the attribute drop area and contains the following:

  • Minimum item support (# samples): all items which appear in orders fewer times than this threshold are discarded.
    This value is relevant only if the Auto (specify #items) option is unchecked.

  • Auto (specify#items): if selected, the minimum support for items is automatically computed, and the number of items to be taken into account can be specified in the Items to consider spin box.

  • # Items to consider: the number of items to take into account (most frequent first).
    This option is enabled only if the Auto (specify # items) option is selected.

  • Keep auto item support threshold also for itemsets: if selected, all itemsets which occur fewer times than this threshold are discarded.
    This option is enabled only if the Auto (specify # items) option is selected.

  • Minimum itemset support (# samples): all itemsets which occur fewer times than this threshold are discarded. This option is enabled only if the Auto (above average) option is not selected.

  • Auto (above average): if selected, the minimum itemset support value is set to the average support of itemsets with the same cardinality.

  • Maximum itemset cardinality: the maximum cardinality of generated itemsets.

  • No maximum itemset cardinality: if selected, all itemsets with higher support than the specified threshold are generated, regardless of their cardinality.

  • Minimum number of different attributes involved in each itemset: determines the minimum number of different attributes that have to be part of an itemset in order not to discard it.

  • Hierarchical idem attributes: if checked, this option denotes the existence of hierarchical attributes which characterize items, consequently enabling the Item parent attributes drop area.

  • Support count only for top-level attributes: if checked, this option modifies support computation so that only top-level attributes in a hierarchy are taken into account. If this option is not checked, for every order, all included elements of the hierarchy increment their support by 1. This option is enabled only if the Hierarchical item attributes option is selected.

Advanced tab

In the Advanced tab, advanced options to customize and complete the analysis are provided. Along with the Available attributes list, you will find the attribute drop area and the attribute filters.

In the attribute drop area, the following panels are provided:

  • Auxiliary attributes (ORDERED, greater values are more relevant): drag the numeral attributes where a high value is more relevant. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

  • Auxiliary attributes (ORDERED, lower values are more relevant): drag the numeral attributes where a low value is more relevant. For example, if the attribute contains how many days have passed from the target transactions, and we are primarily interested in most recent transactions. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

  • Item quantities (INTEGER, range: 1-255): drag the integer attributes for which is needed to calculate their overall quantity in the item quantities target list. Instead of manually dragging and dropping attributes, they can be defined via a filtered list.

The attribute filters available are:

  • Attribute to filter to select rows including relevant items: drag the attribute from the Available attributes list to specify a filtering criterion.
    Items satisfying this criterion are considered as items to be replaced, regardless the number of transactions in which the item appears, that is its support.

  • Attribute to filter to discard rows including irrelevant items: drag the attribute from the Available attributes list to specify a filtering criterion.
    Items satisfying this criterion are considered as items to be discarded, regardless the number of transactions in which the item appears, that is its support. If both the selecting and the discarding filters are specified, the discarding filter prevails.

  • Maximum factor per auxiliary attribute adjusting support: specify the value up to which the support of items and associations may be multiplied or divided, according to the average value of its auxiliary attribute(s).

Output tab

In the Output tab, you can customize the output’s features. The following options are provided:

  • Flag maximal frequent itemsets: if selected a column is added to the table which specifies whether a frequent itemset is maximal or nor, i.e. whether it is not included within another frequent itemset or not.

  • Rare itemsets mining: if selected, the output will display rare itemsets instead of frequent itemsets. Rare itemsets are groupings of items that are rarely found together, although they may be frequent individually.

  • Maximum itemset support: this threshold value indicates the maximum number of times the items in an itemset can be found together in order to be considered rare. This option is active only if the Rare itemsets mining is selected.

  • Maximum relative support for itemsets: the support value compares the number of times the item appears with and without the other item in the rare itemset. This option is active only if the Rare itemsets mining is selected.


The Frequent itemsets tab#

This tab displays the generated itemsets in a spreadsheet format. The following attributes are provided in the spreadsheet:

  • Frequent Itemset ID: the sequential ID number for frequent itemsets.

  • Cardinality: the cardinality of the frequent itemset.

  • # Different attributes: the number of child attributes used to perform the analysis.

  • Support: the percentage of orders in which the frequent itemset appears in the dataset.

  • Support#: the number of times the frequent itemset appears in the dataset.

  • All-confidence: the ratio between the support of the itemset and the support of the least frequent item included in the itemset.

  • Item ID #: the ID of the items composing the frequent itemset reported in these columns.

  • Maximal frequent itemset?: it indicates whether the row indicates a maximal frequent itemset.


The Results tab#

The results tab provides information on the computation. It is divided into two sections:

  • The General Info area contains the following information:
    • Task Label: the task’s name.

    • Elapsed time (sec): the time required for the latest computation in seconds.

  • The Result quantities area contains the data quantities: check the results to be visualized, then open them by clicking on the arrow button to visualize the quantities’ values. The following information is provided:
    • Number of different items in input

    • Number of different orders in input

    • Number of generated frequent itemsets

      All the quantities listed above are divided in Frequent, Items, Orders.


Example#

  • After having imported the dataset (remember to set to 0 the get names from line), link a Data Manager to the import task and open it.

  • Add a new attribute column to the dataset, called ORDER_ID, then select it and populate it with the values resulting from the enum() function.

  • Then, set the type to nominal. save and compute the task.

https://cdn.rulex.ai/docs/Factory/fim-example-1.webp
  • The current format of the dataset is not suitable for the Frequent Itemsets Mining task as each row represents a full transaction and not a single purchase. The dataset must be restructured so that the information concerning a purchase of n items is distributed over n rows, each one including an Order ID/Item ID pairing.

  • The dataset can be restructured by adding a Reshape To Long task to the flow.

  • Double-click the Reshape To Long task and drag all the attributes from the left, apart from the ORDER_ID attribute, onto the Attributes to be transformed in long format target list.

  • Save and compute the task.

https://cdn.rulex.ai/docs/Factory/fim-example-2.webp
  • Right-click the Reshape To Long task and select Take a look to check the new structure.

  • The dataset is now structured with a row for every single purchase.

https://cdn.rulex.ai/docs/Factory/fim-example-3.webp
  • Now add a Frequent Itemsets Mining task, and configure the task as follows:
    • Drag the ORDER_ID attribute in the Order key attributes target list.

    • Drag the Wide_1 attribute in the Item child attributes target list.

    • Select the Auto (specify #items) checkbox.

    • Set the Items to consider to 50, to evaluate the 50 top-selling items.

    • Set the Maximum itemset cardinality to 3.

    • Save and compute the task.

https://cdn.rulex.ai/docs/Factory/fim-example-4.webp
  • The resulting itemsets are displayed in the Frequent Itemsets tab.

https://cdn.rulex.ai/docs/Factory/fim-example-5.webp