Protopipe

Wine classification with SLAVE, a genetic fuzzy system

In this tutorial we will train and test a genetic fuzzy system for classifying the Wine Data Set from the UCI Machine Learning Repository.

For this tutorial we are going to use SLAVE: a genetic learning system based on an iterative approach. The training process has some parameters that can be tuned, such as the population size or the mutation probability.

The performance of the model will be determined by averaging all the accuracy measures obtained in a 5-fold cross-validation. This mark will be used to automatically fine tune some parameters of the model.

1. Create a new project

Empty projects screen

In the projects screen, press the Create new project button.

"Create new project" button

Write a name for the project.

"Create new project" dialog

And press Start. The work screen will appear.

Empty work screen

2. Upload the data

Download the file wine.data from here.

Now we need to upload the file to the project. Press the Create card button.

"Create card" button

Press Upload file and select the file you just downloaded. A new entry, wine.data, will appear in the menu.

Data tab

Press the wine.data entry to create an Open file card.

"Open file" card

3. Prepare the data

According to the dataset description, wine.data is a comma-separated values file with the following columns:

Class (1, 2 or 3)
Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280/OD315 of diluted wines
Proline

Our model will take as input columns 2 to 14 and will give as output a prediction of the first column.

First of all we need to convert the file stream into a 2D tensor (also known as table).

Press the Create card button.

"Create card" button

In the Modules tab, navigate to Files and formats and press Read as CSV.

"Create card" menu

A new Read as CSV card will appear in the blueprint.

"Read as CSV" card

Connect the Stream output from Open file to the Stream input of Read as CSV.

Current pipeline

Now we just have to configure the parameters of the CSV reader—delimiter and header row(s)—. Getting a glimpse of the file contents can help us in this task.

Select the Open file card by pressing on it and press the Preview output button on the top bar:

"Preview output" button

A dialog will appear showing a preview of the contents of wine.data.

"Preview output" dialog

As we can see, this file has no headers and the values are delimited by commas (,). This is the current configuration of our CVS reader, so it is ready to process the file.

Select the Read as CSV card and press the Preview output button on the top bar. The preview dialog will show a table this time.

"Preview output" dialog

If we scroll down we will see that the dataset is sorted by the first column (the class). Performing a k-fold cross-validation in this state will lead to very poor results, since the training dataset will not contain the same classes as the test dataset in most cases.

We need to shuffle the dataset. Press the Create card button, navigate to Tables and press Shuffle rows. A new Shuffle rows card will appear in the blueprint.

"Shuffle rows" card

Connect the Table output from Read as CSV to the Data input of Shuffle rows.

Current pipeline

If we get a preview of the Shuffled output, we can see that the table has been correctly shuffled.

"Preview output" dialog

We are finally ready to perform the K-fold cross-validation.

Press the Create card button, navigate to Validation and press K-fold cross-validation. A new K-fold cross-validation card will appear in the blueprint.

"K-fold cross-validation" card

Connect the Shuffled output from Shuffle rows to the Data input of K-fold cross-validation.

Current pipeline

As we can see, the K-fold cross-validation card already has a default value of 5 for K, which exactly what we need.

This card also triggers two different events: On each fold, that provides the training and testing Tensors and On finish, that has no associated data.

At this point our data is ready. We can start working on the training part of the pipeline.

4. Training the model

First of all we need to split the input and output of the training table. Press the Create card button, navigate to Tables and press Split into X and Y by columns. A new Split into X and Y by columns card will appear in the blueprint.

"Split into X and Y by columns" card

Connect the Training output from K-fold cross-validation to the Data input of Split into X and Y by columns.

Current pipeline

We want to predict the value of the 1st column of the table (index 0) using columns 2 to 14 (indexes 1 to 13) as input for our model. Set X column(s) to “1:13” and Y column(s) to “0” in Split into X and Y by columns.

"Split into X and Y by columns" card configured

We are ready to train our model. Press the Create card button, navigate to Models ≫ Fuzzy logic and press Train SLAVE classifier. A new Train SLAVE classifier card will appear in the blueprint.

"Train SLAVE classifier" card

Connect the X and Y outputs from Split into X and Y by columns to the Training X and Training Y inputs of Train SLAVE classifier respectively.

Current pipeline

As we can see, the training process of SLAVE depends on several parameters that can be tuned (e.g., number of labels, population size, mutation probability). We do not know what combination of values lead to the best accuracy, or what effect each parameter has on the overall performance, but that is not a problem. Protopipe has a way of answering this kind of questions.

In this tutorial we will analyze the effect of the population size and the mutation probability on the training process.

Press the Create card button, navigate to Parameters and press Integer parameter. A new dialog will appear asking for the name of this parameter.

Dialog asking for the name of the parameter

Write “Population size” and press Set. A new Integer parameter card will appear in the blueprint.

Connect the Value output from Integer parameter to the Population size input of Train SLAVE classifier.

Current pipeline

We must specify a domain of possible values for this parameter. In this tutorial we will try 10, 50 and 100.

"Integer parameter" card configured

Now we will do the same for mutation probability. Press the Create card button, navigate to Parameters and press Float parameter. Name this parameter “Mutation probability”.

Connect the Value output from Float parameter to the Mutation probability input of Train SLAVE classifier.

Current pipeline

The domain of possible values for this parameter will be between 0.1 and 0.9.

"Float parameter" card configured

This finishes the training part of the pipeline. We are ready to measure the performance of the model.

5. Testing the model

First of all we need to split the input and output of the testing table, as we previously did with the training table. Press the Create card button, navigate to Tables and press Split into X and Y by columns. A new Split into X and Y by columns card will appear in the blueprint.

"Split into X and Y by columns" card

Connect the Testing output from K-fold cross-validation to the Data input of Split into X and Y by columns.

Current pipeline

Analogously as we did before, set X column(s) to “1:13” and Y column(s) to “0” in Split into X and Y by columns.

"Split into X and Y by columns" card configured

At this point we are ready to ask the model for a prediction. Press the Create card button, navigate to Models and press Predict. A new Predict card will appear in the blueprint.

"Predict" card

Connect the Model output from Train SLAVE classifier to the Model input of Predict, then connect the X output from Split into X and Y by columns to the X input of Predict.

Current pipeline

The performance of the model will be determined by averaging the accuracy obtained in each fold. Press the Create card button, navigate to Loss functions and press Accuracy. A new Accuracy card will appear in the blueprint.

"Accuracy" card

Connect the Y’ output from Predict to the Predictions input of Accuracy, then connect the Y output from Split into X and Y by columns to the Target input of Accuracy.

Current pipeline

Our goal is to calculate the mean accuracy of all folds, so we need to store each obtained accuracy value somewhere. On each fold we will load a variable list of float, add the accuracy value and store it again.

Press the Create card button, navigate to Variables ≫ Getters and press Get list of float. A new Get list of float card will appear in the blueprint.

"Get list of float" card

Set the Name input as “accuracies”.

Make Get list of float an explicit listener of the On each fold event of K-fold cross-validation by dragging & dropping the square socket next to Call into the Get list of float card.

We need to load and store the variable on each fold in order to always access to its last version, otherwise the system would load it only once and it would always be empty.

Current pipeline

Press the Create card button, navigate to Lists and press Add float to list. A new Add float to list card will appear in the blueprint.

"Add float to list" card

Connect the Value output from Get list of float to the List input of Add float to list, then connect the Accuracy output from Accuracy to the Value input of Add float to list.

Current pipeline

Press the Create card button, navigate to Variables ≫ Setters and press Set list of float. A new Set list of float card will appear in the blueprint.

"Set list of float" card

Set the Name input as “accuracies”.

Connect the Result output from Add float to list to the Value input of Set list of float.

Current pipeline

At this point our pipeline is the accuracy obtained by each trained model on each fold. We are ready to compute their mean value.

Press the Create card button, navigate to Variables ≫ Getters and press Get list of float. A new Get list of float card will appear in the blueprint.

"Get list of float" card

Set the Name input as “accuracies”.

Make this Get list of float card an explicit listener of the On finish event of K-fold cross-validation by dragging & dropping the square socket next to Call into the Get list of float card.

Current pipeline

Press the Create card button, navigate to Statistics and press Mean. A new Mean card will appear in the blueprint.

"Mean" card

Connect the Value output of Get list of float to the Values input of Mean.

Current pipeline

We are almost done. Now we just need to return the obtained value, so Protopipe can generate the final report. Press the Create card button, navigate to Returns and press Return float. A new dialog will appear asking for the name of this return value.

Dialog asking for the name of the return value

Write “Mean accuracy” and press Set. A new Return float card will appear in the blueprint.

Connect the Mean output from Mean to the Value input of Return float.

Current pipeline

That’s it! Our pipeline is ready!

6. Experimenting

Press the Fine tune settings button on the top bar.

"Fine tune settings" button

A new panel will appear at the right side of the screen.

"Fine tune settings" panel

In Assignation of the values choose “Brute-force” and then set “3 samples” for the Mutation probability float parameter.

"Fine tune settings" panel

Press the Start processing button to run all the experiments.

"Start processing" button

A new panel will appear at the right side of the screen, showing real-time information about the state of the processing.

"Processing" panel

When the processing successfully finishes a new dialog will appear.

"Processing finished" dialog

Press See report to open the Reports screen, that will contain a table that summarizes all the experiments performed.

Table of performed experiments

Sort the table by “Mean accuracy” in descending order to check what model performed better.

Table of performed experiments sorted by mean accuracy

7. Analysis

On the left side panel, under the most recent report node, click on Cross-sectional analysis.

Cross-sectional analysis screen

In this screen you can compare the effect of a parameter (X axis) on a return value (Y axis).

For example, here we can see how the Population size affects Mean accuracy:

Population size vs. Mean accuracy

8. Conclusion

In this tutorial we designed a pipeline for training and testing a classifier and analyzed the obtained results after performing multiple experiments with different combinations of parameter values.

This tutorial can be extended by fine tuning other parameters (e.g., number of labels, use rule weights) or generating reports with the accuracy obtained for each class separately.