Robot

Get familiar with the most popular Machine Learning algorithm

SHARE

Share on linkedin
Share on twitter
Let's discover how to write Python code using the Scikit-learn Machine Learning library to solve a multiclass classification problem.

Machine Learning types

To start, let’s explain the different classes of Machine Learning: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. In this article, we will focus on supervised learning. 

Supervised learning

Iris
An iris

In supervised learning, the machine is learning how to map an input to an output based on some example of input-output pairs provided by a data scientist who acts as a tutor.

To illustrate supervised learning, I will use one of the most popular datasets for Machine Learning: the iris dataset that you can download on UCI [1]

In this example, our goal is to classify an iris amongst the three existing iris classes: Iris Setosa, Iris Versicolour, and Iris Virginica.

This is known as a multiclass classification problem: we want to sort something into several different groups. When we have only two groups, we call this a binary classification. In our case, we have 3 groups corresponding to the 3 iris classes.

An iris class can be recognized based on their sepal (the outer parts of the flower that enclose a developing bud) and petal length and width. The sepal length, the sepal width, the petal length, and the petal width are the 4 features. In ML, features are individual independent variables that act as an input in the system. Feature engineering is the process of using the domain knowledge of the data to create features that makes ML algorithms work properly.

The training data is giving 150 instances of sepal and petal length and width and gives the corresponding outcome: the iris class.

Supervised Learning
Supervised Learning - A multclass classification problem sample with iris classes

Scikit-learn, an open-source library to implement ML

Let’s build a machine learning implementation that is able to predict the iris class based on specific iris inputs. For this implementation, we will use the open-source libraries called Scikit-learn

Scikit-learn

Scikit-learn is a free software ML library for Python. It proposes different functionalities including classification.

To use Scikit-learn, I recommend the installation of Anaconda, as Scikit-learn is coming preinstalled with it. Anaconda is a distribution of Python for data science, ML,  and analytics. 

You can check that Scikit-learn (aka sklearn) is available by creating a file name iris.py with those 2 lines of code:

import sklearn as sk
print(sk.__version__)

When you execute your code by choosing Anaconda prompt and by typing python iris.py, this is displaying the version of the sklearn library, in my case 0.23.2.

Implementation of a decision tree

To solve our multiclass classification problem, we will use a decision tree algorithm. 

The following code open the CSV file from UCI web site, build the dataset, separates the features and the labels, creation a decision tree and populates the decision tree based on the features and the labels. Then, it displays the decision tree and finaly makes a prediction based on an unknown iris.

# Load libraries
from pandas import read_csv
from sklearn import tree
from matplotlib.pylab import rcParams

# Load dataset from UCI
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values

# Select which part of the matrix is input and which is output/class
features = array[:, 0:4]
labels = array[:, 4]

# Creates an empty tree classifier object
clf = tree.DecisionTreeClassifier()

# Populates the classifier with the features and the labels
clf = clf.fit(features, labels)

# Display the tree
rcParams['figure.figsize'] = 15, 10
tree.plot_tree(clf)

# Predict class for 1 unknown iris
irisClass = clf.predict([[5.2, 3.2, 4.2, 1.2]])
print(irisClass)

This is how the decision tree looks like:

Iris Classification Decision Tree
Iris Classification Decision Tree

You can easily follow how the machine is using the decision tree to make its decision:

  1. It starts with X[3], the fourth feature (as X[0] is the first one), namely the petal width. In our case, it is not <= 0.8, we go on the right of the tree as our iris as a petal width of 1.2.
  2. Is X[3] <= 1.75? Yes, we go on the left.
  3. Is X[2] <= 4.95? Yes, we go on the left as our iris has a petal length of 4.2.
  4. Is X[3] <= 1.65? Yes, we go on the left and we realize that our iris is the 2nd class, which is an Iris Versicolor. BTW, this is the answer that our little Python program is giving.
I like decision trees as they are easy to follow for a human being.  

The Gini index

The Gini index shown in the picture varies between values 0 and 1, where 0 expresses the purity of classification, i.e. All the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes. That is why all the leaves of the tree have a Gini index of 0.

A brief summary

This short example allows to learn the following concepts and tricks:

  • The are multiple ML types and supervised learning is one of them
  • Supervised learning is good to solve binary or multiclass classification problems
  • We know better the iris dataset, which is a popular dataset to start with when learning ML
  • We have learned what is feature engineering and we have build 4 features in the iris example
  • Our training data contains the features and also the labels
  • In this example, we have used a decision tree algorithm
  • We have used Scikit-learn open-source ML learning library to build our decision tree based on our data with only a couple of lines of code
  • It is possible to display the tree and to understand how the machine will do the prediction for an unknown iris
  • We are now familiar with the Gini index

Sources

[1] Iris Data Set, UCI Machine Learning Repository, University of California, Irvine

[2] Hello World – Machine Learning Recipes #1, a video from Josh Gordon, Staff Developer Advocate ML Frameworks,  Google

[3] Your First Machine Learning Project in Python Step-By-Step by Jason Brownlee on February 10, 2019 

Photo credits

Robot Photo by Rock’n Roll Monkey on Unsplash

Iris Photo by Olesya Blinskaya on Unsplash

Leave a Reply