Principal Component Analysis

The Principal Component Analysis task identifies the most important components in a dataset and consequently reduces the number of attributes a dataset contains. These components correspond to a linear combination of attributes that collect most variance in the values. The Principal Component Analysis basically compresses a large amount of data into a smaller number of attributes that capture the essence of the original data. To put it simply, think of our TVs that show us 3D people and places flattened into 2D viewing. Although a dimension is missing, we don’t lose much detail.

The first new “attribute” (called an eigenvector) represents the maximum variation in the data, the second eigenvector represents the second largest amount of variation and so on. In the Principal Component Analysis task in Rulex you can select how many eigenvectors you want to create in your new compressed dataset.

This function is extremely useful when dealing with datasets that have a very high number of attributes, or also when preparing large datasets for tasks such as clustering, neural networks and linear fit. The technique can also help to avoid overfitting in rules, where there are so many attributes that rules get be overly precise and articulate in the training set, and consequently not produce good results when applied to new data.

However, eigenvectors do not represent a single aspect of the original dataset, such as age or occupation, but a combination of these. The task does not subsequently result in the generation of immediately human understandable explainable rules. It is possible to analyze how much each original attribute influenced the eigenvectors in the rules, but this method is rather approximate and not particularly reliable. It would not make much sense, for example, to use this task with the Logic Learning Machine algorithm in Rulex.

Consequently, if you need to explain decisions, avoid using the Principal Component Analysis task in your flow.

The Principal Component Analysis task is divided into two tabs, which will be analyzed in the sections below:


The Options tab

The Options tab is made of two areas:

  • The Available attributes area, where you can find where you can find the attributes in the source dataset.

  • The so-called configuration section. Here you can set the following parameters:
    • Use previous eigenvectors for Principal Component Analysis execution: if selected, the eigenvectors defined in the upstream PCA task will be used to create the required number of principal components.

    • Attributes for principal component analysis: drag here those attributes which you want to use in the principal component analysis. Principal Component Analysis cannot be performed on nominal values. Attributes can also be defined via a filtered list.

    • Method for distance evaluation: the method you want to use to compute distances between samples. The distance is computed as the combination of the distances for each attribute.
      Possible options are: Euclidean, Euclidean (normalized), Manhattan, Manhattan (normalized) and Pearson.

    • Normalization: the type of normalization you want to use with ordered variables.
      Possible options are: None, Attribute, Normal, Minmax [0,1] and Minmax [-1,1].

    • Minimum number of final components (0 means no minimum): the minimum number of final components the resulting dataset must contain.
      If this minimum number does not satisfy the minimum confidence specified for the analysis in the Minimum level of confidence for the resulting dataset option, sufficient components will be added to reach the minimum confidence level.

    • Minimum level of confidence for the resulting dataset (0 means no minimum): the minimum level of confidence the resulting dataset must have.
      If this minimum confidence level does not satisfy the minimum number of components specified for the analysis in the Minimum number of final components option, the confidence level may increase until the minimum number of components is also reached.

    • Aggregate data before processing: if selected, identical patterns will be aggregated and considered as a single pattern during the principal component analysis.


The Results tab

The Results tab provides information on the computation: it is divided into two sections:

  • In the General info section, you will find general information on the computation:
    • Task label

    • Elapsed time (sec)

    • Number of input attributes

    • Resulting number of attributes

    • Resulting level of confidence

    • Minimum eigenvalue

    • Maximum eigenvalue

  • In the Result quantities section, details on the analysis’ output are provided:
    • The Number of samples provided: they provide information on the results divided into subsets of data, like Train, Test, Valid, Whole.