Introduction to machine learning
Today we take a closer look at the most basic machine learning algorithm to train on well-known dataset of Iris flowers (it has itself even Wikipedia page) and predict new Iris flowers based on your measurements.
First of all, we need to have Python installed (this tutorial is written for Python 2.7). Then some proper Python IDE or some text editor. I highly recommend Spyder (for Windows) and CodeRunner (for macOS).
We are ready to go. So how machine learning works? Well, you need some dataset and some classifier. Each dataset has to contain some measurements e.g., attributes and some labels of class e.g., predictors. Each row represents one instance. On the other hand, classifier is considered as instance of sci-kit learn library object (programmatically speaking). So we need some dataset and create instance of some classifier to train it.
Iris flower dataset
Iris flower dataset is probably the most used dataset in sci-kit lean tutorials so why don’t start with it?
First assumption for dataset is that it has to have proper data structure. Second assumption is to “somehow” (it is for another tutorial) represent population of which measurements come from. If your dataset won’t be variable same as your population, it will probably won’t work good because of ovewfitting. Classifier would not predict very well data with big variability.
Dataset was presented by sir Ronald Fisher in 1936 and it contains measurements about Iris flowers. Sepal and petal length and width. Then it contains variable target_names which contains a type of flower – setosa, versicolor or virginica. For computation purposes it is coded as nominal values 0, 1 and 2.
On the following image you can see distribution of each instances of “tested” population of Iris flowers.
Importing dataset is easy since it is included in sci-kit installation. So we start our script with:
from sklearn.datasets import load_iris iris = load_iris()
Then variables iris contains the whole dataset. We have to parse it to be more useful for machine learning classifier. What about to parse it into attributes and labels – predictors.
X = iris.data Y = iris.target
Classifier and training
Create and train classifier is due to sci-kit learn module pretty easy. First we need to create instance of object classifier. So first we need that object and this is done by importing another module to our Python script.
from sklearn import tree
Then we can create instance of object tree like a charm:
clf = tree.DecisionTreeClassifier()
Method of training is also very straightforward. It contains only two (required) parameters such as attributes and predictors. We can simply use our parsed data we prepared before:
clf = clf.fit(X, Y)
Then comes a little bit of programming, but don’t be afraid it is easy one. New measurement can be predicted by calling method predict on classifier object like:
p = clf.predict([2,3,1,3]) print p
But as mentioned all predicted class would be integers so you have to remember that 0 is setosa, 1 is versicolor and 2 is virginica. If you feel into some coding let’s get into it.
We built pretty simple function which tells us predicted Iris flower based on predicted value of classifier instance. Its only parameter is integer from 0 to 2 and it return a printout on screen which Iris flower it is. It is pretty handy and easy so why not.
We declare this function after importing all modules at the beginning of the script.
def whichtarget(predictor): if predictor == 1: print "It is probably Iris Setosa." elif predictor == 2: print "It is probably Iris Versicolor." elif predictor == 3: print "It is probably Iris Virginica." else: print "Unknown predictor."
And then we can call this function anywhere in our script by simply calling its name which parameter.
The whole script is followings:
# Very basic machine learning model # ================================= # Michael Tesar &lt;email@example.com&gt; # 2016 # It uses decision tree classifier to model # sci-kit learn default dataset about Iris # flowers. It also contains a function which # prints out a string of predicted values into # console. # Import libraries from sklearn import datasets from sklearn import tree import numpy as np # Define function to print which Iris is predicted def whichtarget(predictor): if predictor == 1: print "It is probably Iris Setosa." elif predictor == 2: print "It is probably Iris Versicolor." elif predictor == 3: print "It is probably Iris Virginica." else: print "Unknown predictor." # Load iris dataset iris = datasets.load_iris() # Prepare dataset for analysis X = iris.data Y = iris.target # Create instance of classifier clf = tree.DecisionTreeClassifier() # Train classifier clf = clf.fit(X, Y) # Predict new values p = clf.predict([2,3,1,3]) # Print out which flower it is whichtarget(p)