Wine classification with SLAVE, a genetic fuzzy system
In this tutorial we will train and test a genetic fuzzy system for classifying the Wine Data Set from the UCI Machine Learning Repository.
For this tutorial we are going to use SLAVE: a genetic learning system based on an iterative approach. The training process has some parameters that can be tuned, such as the population size or the mutation probability.
The performance of the model will be determined by averaging all the accuracy measures obtained in a 5-fold cross-validation. This mark will be used to automatically fine tune some parameters of the model.
1. Create a new project
In the projects screen, press the Create new project button.
Write a name for the project.
And press Start. The work screen will appear.
2. Upload the data
Download the file wine.data
from here.
Now we need to upload the file to the project. Press the Create card button.
Press Upload file and select the file you just downloaded. A new entry, wine.data, will appear in the menu.
Press the wine.data entry to create an Open file card.
3. Prepare the data
According to the dataset description, wine.data
is a comma-separated values file with the following columns:
- Class (1, 2 or 3)
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
Our model will take as input columns 2 to 14 and will give as output a prediction of the first column.
First of all we need to convert the file stream into a 2D tensor (also known as table).
Press the Create card button.
In the Modules tab, navigate to Files and formats and press Read as CSV.
A new Read as CSV card will appear in the blueprint.
Connect the Stream output from Open file to the Stream input of Read as CSV.
Now we just have to configure the parameters of the CSV reader—delimiter and header row(s)—. Getting a glimpse of the file contents can help us in this task.
Select the Open file card by pressing on it and press the Preview output button on the top bar:
A dialog will appear showing a preview of the contents of wine.data
.
As we can see, this file has no headers and the values are delimited by commas (,). This is the current configuration of our CVS reader, so it is ready to process the file.
Select the Read as CSV card and press the Preview output button on the top bar. The preview dialog will show a table this time.
If we scroll down we will see that the dataset is sorted by the first column (the class). Performing a k-fold cross-validation in this state will lead to very poor results, since the training dataset will not contain the same classes as the test dataset in most cases.
We need to shuffle the dataset. Press the Create card button, navigate to Tables and press Shuffle rows. A new Shuffle rows card will appear in the blueprint.
Connect the Table output from Read as CSV to the Data input of Shuffle rows.
If we get a preview of the Shuffled output, we can see that the table has been correctly shuffled.
We are finally ready to perform the K-fold cross-validation.
Press the Create card button, navigate to Validation and press K-fold cross-validation. A new K-fold cross-validation card will appear in the blueprint.
Connect the Shuffled output from Shuffle rows to the Data input of K-fold cross-validation.
As we can see, the K-fold cross-validation card already has a default value of 5 for K, which exactly what we need.
This card also triggers two different events: On each fold, that provides the training and testing Tensors and On finish, that has no associated data.
At this point our data is ready. We can start working on the training part of the pipeline.
4. Training the model
First of all we need to split the input and output of the training table. Press the Create card button, navigate to Tables and press Split into X and Y by columns. A new Split into X and Y by columns card will appear in the blueprint.
Connect the Training output from K-fold cross-validation to the Data input of Split into X and Y by columns.
We want to predict the value of the 1st column of the table (index 0) using columns 2 to 14 (indexes 1 to 13) as input for our model. Set X column(s) to “1:13” and Y column(s) to “0” in Split into X and Y by columns.
We are ready to train our model. Press the Create card button, navigate to Models ≫ Fuzzy logic and press Train SLAVE classifier. A new Train SLAVE classifier card will appear in the blueprint.
Connect the X and Y outputs from Split into X and Y by columns to the Training X and Training Y inputs of Train SLAVE classifier respectively.
As we can see, the training process of SLAVE depends on several parameters that can be tuned (e.g., number of labels, population size, mutation probability). We do not know what combination of values lead to the best accuracy, or what effect each parameter has on the overall performance, but that is not a problem. Protopipe has a way of answering this kind of questions.
In this tutorial we will analyze the effect of the population size and the mutation probability on the training process.
Press the Create card button, navigate to Parameters and press Integer parameter. A new dialog will appear asking for the name of this parameter.
Write “Population size” and press Set. A new Integer parameter card will appear in the blueprint.
Connect the Value output from Integer parameter to the Population size input of Train SLAVE classifier.
We must specify a domain of possible values for this parameter. In this tutorial we will try 10, 50 and 100.
Now we will do the same for mutation probability. Press the Create card button, navigate to Parameters and press Float parameter. Name this parameter “Mutation probability”.
Connect the Value output from Float parameter to the Mutation probability input of Train SLAVE classifier.
The domain of possible values for this parameter will be between 0.1 and 0.9.
This finishes the training part of the pipeline. We are ready to measure the performance of the model.
5. Testing the model
First of all we need to split the input and output of the testing table, as we previously did with the training table. Press the Create card button, navigate to Tables and press Split into X and Y by columns. A new Split into X and Y by columns card will appear in the blueprint.
Connect the Testing output from K-fold cross-validation to the Data input of Split into X and Y by columns.
Analogously as we did before, set X column(s) to “1:13” and Y column(s) to “0” in Split into X and Y by columns.
At this point we are ready to ask the model for a prediction. Press the Create card button, navigate to Models and press Predict. A new Predict card will appear in the blueprint.
Connect the Model output from Train SLAVE classifier to the Model input of Predict, then connect the X output from Split into X and Y by columns to the X input of Predict.
The performance of the model will be determined by averaging the accuracy obtained in each fold. Press the Create card button, navigate to Loss functions and press Accuracy. A new Accuracy card will appear in the blueprint.
Connect the Y’ output from Predict to the Predictions input of Accuracy, then connect the Y output from Split into X and Y by columns to the Target input of Accuracy.
Our goal is to calculate the mean accuracy of all folds, so we need to store each obtained accuracy value somewhere. On each fold we will load a variable list of float, add the accuracy value and store it again.
Press the Create card button, navigate to Variables ≫ Getters and press Get list of float. A new Get list of float card will appear in the blueprint.
Set the Name input as “accuracies”.
Make Get list of float an explicit listener of the On each fold event of K-fold cross-validation by dragging & dropping the square socket next to Call into the Get list of float card.
We need to load and store the variable on each fold in order to always access to its last version, otherwise the system would load it only once and it would always be empty.
Press the Create card button, navigate to Lists and press Add float to list. A new Add float to list card will appear in the blueprint.
Connect the Value output from Get list of float to the List input of Add float to list, then connect the Accuracy output from Accuracy to the Value input of Add float to list.
Press the Create card button, navigate to Variables ≫ Setters and press Set list of float. A new Set list of float card will appear in the blueprint.
Set the Name input as “accuracies”.
Connect the Result output from Add float to list to the Value input of Set list of float.
At this point our pipeline is the accuracy obtained by each trained model on each fold. We are ready to compute their mean value.
Press the Create card button, navigate to Variables ≫ Getters and press Get list of float. A new Get list of float card will appear in the blueprint.
Set the Name input as “accuracies”.
Make this Get list of float card an explicit listener of the On finish event of K-fold cross-validation by dragging & dropping the square socket next to Call into the Get list of float card.
Press the Create card button, navigate to Statistics and press Mean. A new Mean card will appear in the blueprint.
Connect the Value output of Get list of float to the Values input of Mean.
We are almost done. Now we just need to return the obtained value, so Protopipe can generate the final report. Press the Create card button, navigate to Returns and press Return float. A new dialog will appear asking for the name of this return value.
Write “Mean accuracy” and press Set. A new Return float card will appear in the blueprint.
Connect the Mean output from Mean to the Value input of Return float.
That’s it! Our pipeline is ready!
6. Experimenting
Press the Fine tune settings button on the top bar.
A new panel will appear at the right side of the screen.
In Assignation of the values choose “Brute-force” and then set “3 samples” for the Mutation probability float parameter.
Press the Start processing button to run all the experiments.
A new panel will appear at the right side of the screen, showing real-time information about the state of the processing.
When the processing successfully finishes a new dialog will appear.
Press See report to open the Reports screen, that will contain a table that summarizes all the experiments performed.
Sort the table by “Mean accuracy” in descending order to check what model performed better.
7. Analysis
On the left side panel, under the most recent report node, click on Cross-sectional analysis.
In this screen you can compare the effect of a parameter (X axis) on a return value (Y axis).
For example, here we can see how the Population size affects Mean accuracy:
8. Conclusion
In this tutorial we designed a pipeline for training and testing a classifier and analyzed the obtained results after performing multiple experiments with different combinations of parameter values.
This tutorial can be extended by fine tuning other parameters (e.g., number of labels, use rule weights) or generating reports with the accuracy obtained for each class separately.