Decision Tree from the Scratch

Python algorithm built from the scratch for a simple Decision Tree.

Theory and Math behind this post is covered here in Medium post.

We have just looked at Mathematical working for ID3, this post we will see how to build this in Python from the scratch. We will make it simple by using Pandas dataframes.

Load the prerequisites

In [1]:
import numpy as np
import pandas as pd
eps = np.finfo(float).eps
from numpy import log2 as log

‘eps’ here is the smallest representable number. At times we get log(0) or 0 in the denominator, to avoid that we are going to use this.

The dataset is very small one, we’ll just represent that with a dictionary.

In [3]:
dataset = {'Taste':['Salty','Spicy','Spicy','Spicy','Spicy','Sweet','Salty','Sweet','Spicy','Salty'],

We will read this with Pandas dataframe

In [5]:
df = pd.DataFrame(dataset,columns=['Taste','Temperature','Texture','Eat'])
In [6]:
Taste Temperature Texture Eat
0 Salty Hot Soft No
1 Spicy Hot Soft No
2 Spicy Hot Hard Yes
3 Spicy Cold Hard No
4 Spicy Hot Hard Yes
5 Sweet Cold Soft Yes
6 Salty Cold Soft No
7 Sweet Hot Soft Yes
8 Spicy Cold Soft Yes
9 Salty Hot Hard Yes

We have seen from an earlier post we need to find the Entropy and then Information Gain for splitting the data set.

$S = - \sum\limits_{i=1}^N p_i log_2 p_i$

Here the fraction is $p_i$, it is the proportion of a number of elements in that split group to the number of elements in the group before splitting(parent group).

We’ll define a function that takes in class (target variable vector) and finds the entropy of that class.

In [10]:
entropy_node = 0  #Initialize Entropy
values = df.Eat.unique()  #Unique objects - 'Yes', 'No'
for value in values:
    fraction = df.Eat.value_counts()[value]/len(df.Eat)
    entropy_node += -fraction*np.log2(fraction)

Summation is taken care of by ‘entropy +=’

We can check this by supplying our data frame to this function

In [11]:

This is same as the entropy $(E_o)$ we calculated in the theory post. Now the entropy of the attribute Taste

In [12]:
attribute = 'Taste'
target_variables = df.Eat.unique()  #This gives all 'Yes' and 'No'
variables = df[attribute].unique()    #This gives different features in that attribute (like 'Sweet')
entropy_attribute = 0
for variable in variables:
    entropy_each_feature = 0
    for target_variable in target_variables:
        num = len(df[attribute][df[attribute]==variable][df.Eat ==target_variable]) #numerator
        den = len(df[attribute][df[attribute]==variable])  #denominator
        fraction = num/(den+eps)  #pi
        entropy_each_feature += -fraction*log(fraction+eps) #This calculates entropy for one feature like 'Sweet'
    fraction2 = den/len(df)
    entropy_attribute += -fraction2*entropy_each_feature   #Sums up all the entropy ETaste
In [13]:

This is same as the entropy $E_{Taste}$ we calculated in the theory post

Now the Information Gain is simply

$IG_{Taste} = entropy_{node} — entropy_{attribute} = 0.21$

We will continue this for the other attributes ‘Temperature’ and ‘Texture’.

We just need to replace attribute= ‘Taste’ with ‘Temperature’ and ‘Texture’

We get,

$E_{Temperature} = 0.9509$

$E_{Texture} = 0.9245$

Then Information Gain,

$IG_{Temperature} = 0.02$

$IG_{Texture} = 0.05$

Next process:

We’ll find the winner node, the one with the highest Information Gain. We repeat this process to find which is the attribute we need to consider to split the data at the nodes.

We build a decision tree based on this. Below is the complete code.

In [15]:
def find_entropy(df):
    Class = df.keys()[-1]   #To make the code generic, changing target variable class name
    entropy = 0
    values = df[Class].unique()
    for value in values:
        fraction = df[Class].value_counts()[value]/len(df[Class])
        entropy += -fraction*np.log2(fraction)
    return entropy

def find_entropy_attribute(df,attribute):
  Class = df.keys()[-1]   #To make the code generic, changing target variable class name
  target_variables = df[Class].unique()  #This gives all 'Yes' and 'No'
  variables = df[attribute].unique()    #This gives different features in that attribute (like 'Hot','Cold' in Temperature)
  entropy2 = 0
  for variable in variables:
      entropy = 0
      for target_variable in target_variables:
          num = len(df[attribute][df[attribute]==variable][df[Class] ==target_variable])
          den = len(df[attribute][df[attribute]==variable])
          fraction = num/(den+eps)
          entropy += -fraction*log(fraction+eps)
      fraction2 = den/len(df)
      entropy2 += -fraction2*entropy
  return abs(entropy2)

def find_winner(df):
    Entropy_att = []
    IG = []
    for key in df.keys()[:-1]:
#         Entropy_att.append(find_entropy_attribute(df,key))
    return df.keys()[:-1][np.argmax(IG)]

def get_subtable(df, node,value):
  return df[df[node] == value].reset_index(drop=True)

def buildTree(df,tree=None):
    Class = df.keys()[-1]   #To make the code generic, changing target variable class name

    #Here we build our decision tree

    #Get attribute with maximum information gain
    node = find_winner(df)

    #Get distinct value of that attribute e.g Salary is node and Low,Med and High are values
    attValue = np.unique(df[node])

    #Create an empty dictionary to create tree    
    if tree is None:
        tree[node] = {}

   #We make loop to construct a tree by calling this function recursively. 
    #In this we check if the subset is pure and stops if it is pure. 

    for value in attValue:

        subtable = get_subtable(df,node,value)
        clValue,counts = np.unique(subtable['Eat'],return_counts=True)

        if len(counts)==1:#Checking purity of subset
            tree[node][value] = clValue[0]
            tree[node][value] = buildTree(subtable) #Calling the function recursively 

    return tree

Now we call the buildTree function and print the tree we built.

In [16]:
tree = buildTree(df)
In [17]:
import pprint
{'Taste': {'Salty': {'Texture': {'Hard': 'Yes', 'Soft': 'No'}},
           'Spicy': {'Temperature': {'Cold': {'Texture': {'Hard': 'No',
                                                          'Soft': 'Yes'}},
                                     'Hot': {'Texture': {'Hard': 'Yes',
                                                         'Soft': 'No'}}}},
           'Sweet': 'Yes'}}

You can see the tree structure is formed based on the priority list to split the data.

We can write an algorithm to predict using this tree structure.

In [18]:
def predict(inst,tree):
    #This function is used to predict for any input variable 

    #Recursively we go through the tree that we built earlier

    for nodes in tree.keys():

        value = inst[nodes]
        tree = tree[nodes][value]
        prediction = 0

        if type(tree) is dict:
            prediction = predict(inst, tree)
            prediction = tree

    return prediction


In [19]:
inst = df.iloc[6]  #This takes row with index 6
In [20]:
Taste          Salty
Temperature     Cold
Texture         Soft
Eat               No
Name: 6, dtype: object

We can check with our available data.

We took a row with index 6 from our dataframe to use this for our prediction.

Get prediction

In [21]:
data = {'Taste':'Salty', 'Temperature':'Cold', 'Texture':'Hard'}
In [22]:
inst = pd.Series(data)
In [23]:
prediction = predict(inst, tree)

Our tree predicted kid will eat this food (which is Salty, Cold and Hard).

This is a simple tree building algorithm without much control parameters.

Some of the contents here, I am inspired from this book ‘Ensemble machine learning-Ankit Dixit’. I encourage you to read it for further explorations.

Below cell is for styling, ignore it.

In [24]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))