Introduction to Random Forest Algorithm with Python

Published in

Good Audience

6 min readSep 25, 2018

Random forest algorithm has become the most common algorithm to be used in ML competitions like Kaggle competitions, If you ever search for an easy to use and accurate ML algorithm, you will absolutely get random forest in the top results. To understand Random forest algorithm you have to be familiar with decision trees at first .

What are Decision Trees?

Decision trees are predictive models that use a set of binary rules to calculate a target value.
There are two types of decision trees are classification and regression trees.
Classification trees are used to create categorical data sets such as land cover classification
Regression trees are used to create continuous data sets such as biomass and percent tree cover.
Each individual tree is a fairly simple model that has branches, nodes and leaves.
The nodes contain the attributes the objective function depends on.

Now after you got familiar with decision trees, you are ready to understand random forest.

What is Random Forest?

As Leo Breiman defined it in the research paper, “ Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ”

Another definition “A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, …} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x ” Briefly, Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Advantages of Random Forests

It can be used for both classification and regression problems
Reduction in overfitting: by averaging several trees, there is a significantly lower risk of overfitting.
Random forests make a wrong prediction only when more than half of the base classifiers are wrong
It is very easy to measure the relative importance of each feature on the prediction. Sklearn as example has powerful library to do that.

Because of that, it is more accurate than most of the other algorithms.

Disadvantages of Random Forests

Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
It’s more complex and computationally expensive than decision tree algorithm.

Important Terminology related to Decision Trees [1]

Let’s look at the basic terminology used with decision trees and random forests :

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

After we got to know some essentials about random forests, let us use this algorithm in some dataset, In our case we will use Kaggle’s titanic survivors dataset that I preprocessed before

And then we will use a neural network to compare the results.

I recommend you to try running the code yourself in this[Colab Notebook]

Import needed dependencies :

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from keras.callbacks import ModelCheckpoint
from sklearn.metrics import accuracy_score

Load the preprocessed dataset:

Download the preprocessed dataset [ Here]

dataset =pd.read_csv('TitanicPreprocessed.csv')
dataset.head()

y = dataset['Survived']
X = dataset.drop(['Survived'], axis = 1)# Split the dataset to trainand test data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

Set the parameters for the random forest model :

parameters = {'bootstrap': True,
              'min_samples_leaf': 3,
              'n_estimators': 50, 
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6,
              'max_leaf_nodes': None}

Hyperparameters of Sklearn Random forest classifier[2] :

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

min_samples_leaf : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node:

If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

n_estimators : integer, optional (default=10)

The number of trees in the forest.

min_samples_split : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

max_features : int, float, string or None, optional (default=”auto”)

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split. -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
If “auto”, then max_features=sqrt(n_features).
If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.

max_depth : integer or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

max_leaf_nodes : int or None, optional (default=None)

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

If you want to learn more about the rest of hyperparameters , check out sklearn.ensemble.RandomForestClassifier

Define the model :

RF_model = RandomForestClassifier(**parameters)

Train the model :

RF_model.fit(train_X, train_y)

Test the trained model on test data :

RF_predictions = RF_model.predict(test_X)score = accuracy_score(test_y ,RF_predictions)
print(score)

0.82511

We see that the model’s accuracy is 82%, not bad at all.

Using Neural Networks:

Define the model :

# Build a neural network :
NN_model = Sequential()NN_model.add(Dense(128, input_dim = 68, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(256, activation='relu'))
NN_model.add(Dense(1, activation='sigmoid'))
NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Define a checkpoint callback :

checkpoint_name = 'Weights-{epoch:03d}-{val_acc:.5f}.hdf5' 
checkpoint = ModelCheckpoint(checkpoint_name, monitor='val_acc', verbose = 1, save_best_only = True, mode ='max')
callbacks_list = [checkpoint]

Train the model :

NN_model.fit(train_X, train_y, epochs=150, batch_size=64, validation_split = 0.2, callbacks=callbacks_list)

The training process :

Epoch 00044: val_acc did not improve from 0.88060 Epoch 45/150 534/534 [==============================] - 0s 149us/step - loss: 0.3196 - acc: 0.8652 - val_loss: 0.4231 - val_acc: 0.8433  Epoch 00045: val_acc did not improve from 0.88060 Epoch 46/150 534/534 [==============================] - 0s 134us/step - loss: 0.3156 - acc: 0.8670 - val_loss: 0.4175 - val_acc: 0.8358  Epoch 00046: val_acc did not improve from 0.88060 Epoch 47/150 534/534 [==============================] - 0s 144us/step - loss: 0.3031 - acc: 0.8689 - val_loss: 0.4214 - val_acc: 0.8433  Epoch 00047: val_acc did not improve from 0.88060 Epoch 48/150 534/534 [==============================] - 0s 131us/step - loss: 0.3117 - acc: 0.8689 - val_loss: 0.4095 - val_acc: 0.8582.
.
.Epoch 00148: val_acc did not improve from 0.88060
Epoch 149/150
534/534 [==============================] - 0s 146us/step - loss: 0.1599 - acc: 0.9382 - val_loss: 1.0482 - val_acc: 0.7761

Epoch 00149: val_acc did not improve from 0.88060
Epoch 150/150
534/534 [==============================] - 0s 133us/step - loss: 0.1612 - acc: 0.9307 - val_loss: 1.1589 - val_acc: 0.7836

Epoch 00150: val_acc did not improve from 0.88060
<keras.callbacks.History at 0x7f47cb549320>

Load wights file of the best model :

wights_file = './Weights-016-0.88060.hdf5' # choose the best checkpoint 
NN_model.load_weights(wights_file) # load it
NN_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Test the trained model on test data :

predictions = NN_model.predict(test_X)
# round predictions
rounded = [round(x[0]) for x in predictions]
predictions = rounded
score = accuracy_score(test_y ,predictions)
print(score)

0.81165

The accuracy of this neural network model is 81%, we notice that using random forest gives us a higher accuracy.

To recap:

We learned some essentials about decisions trees and random forests.
We discussed the advantages and disadvantages of using random forests.
We talked about some important terminologies related to decision trees and random forests.
We applied both random forest algorithm and neural networks to a dataset, and we compared the accuracy of the two models.
Random forests outscored neural networks in the problem of predicting titanic survivors.

References :

You can follow me on Twitter @ModMaamari

Introduction to Random Forest Algorithm with Python

What are Decision Trees?

What is Random Forest?

Advantages of Random Forests

Disadvantages of Random Forests

Important Terminology related to Decision Trees [1]

Import needed dependencies :

Load the preprocessed dataset:

Set the parameters for the random forest model :

Hyperparameters of Sklearn Random forest classifier[2] :

Define the model :

Train the model :

Test the trained model on test data :

Using Neural Networks:

Define the model :

Train the model :

Test the trained model on test data :

References :

You may also like :

Written by Mohammed AL-Ma'amari