Machine Learning types
To start, let’s explain the different classes of Machine Learning: supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning. In this article, we will focus on supervised learning.
In supervised learning, the machine is learning how to map an input to an output based on some example of input-output pairs provided by a data scientist who acts as a tutor.
To illustrate supervised learning, I will use one of the most popular datasets for Machine Learning: the iris dataset that you can download on UCI .
In this example, our goal is to classify an iris amongst the three existing iris classes: Iris Setosa, Iris Versicolour, and Iris Virginica.
This is known as a multiclass classification problem: we want to sort something into several different groups. When we have only two groups, we call this a binary classification. In our case, we have 3 groups corresponding to the 3 iris classes.
An iris class can be recognized based on their sepal (the outer parts of the flower that enclose a developing bud) and petal length and width. The sepal length, the sepal width, the petal length, and the petal width are the 4 features. In ML, features are individual independent variables that act as an input in the system. Feature engineering is the process of using the domain knowledge of the data to create features that makes ML algorithms work properly.
The training data is giving 150 instances of sepal and petal length and width and gives the corresponding outcome: the iris class.
Scikit-learn, an open-source library to implement ML
Scikit-learn is a free software ML library for Python. It proposes different functionalities including classification.
To use Scikit-learn, I recommend the installation of Anaconda, as Scikit-learn is coming preinstalled with it. Anaconda is a distribution of Python for data science, ML, and analytics.
You can check that Scikit-learn (aka sklearn) is available by creating a file name iris.py with those 2 lines of code:
import sklearn as sk print(sk.__version__)
When you execute your code by choosing Anaconda prompt and by typing
python iris.py, this is displaying the version of the sklearn library, in my case 0.23.2.
Implementation of a decision tree
To solve our multiclass classification problem, we will use a decision tree algorithm.
The following code open the CSV file from UCI web site, build the dataset, separates the features and the labels, creation a decision tree and populates the decision tree based on the features and the labels. Then, it displays the decision tree and finaly makes a prediction based on an unknown iris.
# Load libraries from pandas import read_csv from sklearn import tree from matplotlib.pylab import rcParams # Load dataset from UCI url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values # Select which part of the matrix is input and which is output/class features = array[:, 0:4] labels = array[:, 4] # Creates an empty tree classifier object clf = tree.DecisionTreeClassifier() # Populates the classifier with the features and the labels clf = clf.fit(features, labels) # Display the tree rcParams['figure.figsize'] = 15, 10 tree.plot_tree(clf) # Predict class for 1 unknown iris irisClass = clf.predict([[5.2, 3.2, 4.2, 1.2]]) print(irisClass)
This is how the decision tree looks like:
You can easily follow how the machine is using the decision tree to make its decision:
- It starts with X, the fourth feature (as X is the first one), namely the petal width. In our case, it is not <= 0.8, we go on the right of the tree as our iris as a petal width of 1.2.
- Is X <= 1.75? Yes, we go on the left.
- Is X <= 4.95? Yes, we go on the left as our iris has a petal length of 4.2.
- Is X <= 1.65? Yes, we go on the left and we realize that our iris is the 2nd class, which is an Iris Versicolor. BTW, this is the answer that our little Python program is giving.
The Gini index
The Gini index shown in the picture varies between values 0 and 1, where 0 expresses the purity of classification, i.e. All the elements belong to a specified class or only one class exists there. And 1 indicates the random distribution of elements across various classes. The value of 0.5 of the Gini Index shows an equal distribution of elements over some classes. That is why all the leaves of the tree have a Gini index of 0.
A brief summary
This short example allows to learn the following concepts and tricks:
- The are multiple ML types and supervised learning is one of them
- Supervised learning is good to solve binary or multiclass classification problems
- We know better the iris dataset, which is a popular dataset to start with when learning ML
- We have learned what is feature engineering and we have build 4 features in the iris example
- Our training data contains the features and also the labels
- In this example, we have used a decision tree algorithm
- We have used Scikit-learn open-source ML learning library to build our decision tree based on our data with only a couple of lines of code
- It is possible to display the tree and to understand how the machine will do the prediction for an unknown iris
- We are now familiar with the Gini index
 Iris Data Set, UCI Machine Learning Repository, University of California, Irvine
 Hello World – Machine Learning Recipes #1, a video from Josh Gordon, Staff Developer Advocate ML Frameworks, Google
 Your First Machine Learning Project in Python Step-By-Step by Jason Brownlee on February 10, 2019