Random Forest Regression in Python

Table of Contents

Every decision tree has high variance, but when we combine all of them together in parallel then the resultant variance is low as each decision tree gets perfectly trained on that particular sample data, and hence the output doesn’t depend on one decision tree but on multiple decision trees. In the case of a classification problem, the final output is taken by using the majority voting classifier. In the case of a regression problem, the final output is the mean of all the outputs. This part is called Aggregation.

Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.

Download Python Programming Course Syllabus!

Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap.

“Ready to take your python skills to the next level? Sign up for a free demo today!”

We need to approach the Random Forest regression technique like any other machine learning technique

Design a specific question or data and get the source to determine the required data.
Make sure the data is in an accessible format else convert it to the required format.
Specify all noticeable anomalies and missing data points that may be required to achieve the required data.
Create a machine learning model
Set the baseline model that you want to achieve
Train the data machine learning model.

be a python programmer ! learn from the best !

Definitions:
Decision Trees are used for both regression and classification problems. They visually flow like trees, hence the name, and in the regression case, they start with the root of the tree and follow splits based on variable outcomes until a leaf node is reached and the result is given. An example of a decision tree is below:

Here we see a basic decision tree diagram which starts with the Var_1 and splits based off of specific criteria. When ‘yes’, the decision tree follows the represented path, when ‘no’, the decision tree goes down the other path. This process repeats until the decision tree reaches the leaf node and the resulting outcome is decided. For the example above, the values of a, b, c, or d could be representative of any numeric or categorical value.

What is Random Forest?

Random Forest is a Supervised learning algorithm that is based on the ensemble learning method and many Decision Trees. Random Forest is a Bagging technique, so all calculations are run in parallel and there is no interaction between the Decision Trees when building them. RF can be used to solve both Classification and Regression tasks.

The name “Random Forest” comes from the Bagging idea of data randomization (Random) and building multiple Decision Trees (Forest). Overall, it is a powerful ML algorithm that limits the disadvantages of a Decision Tree model.

“Experience the power of our web development course with a free demo – enroll now!”

Random Forest Algorithm

To make things clear let’s take a look at the exact algorithm of the Random Forest:

So, you have your original dataset D, you want to have K Decision Trees in our ensemble. Additionally, you have a number N – you will build a Tree until there are less or equal to N samples in each node (for the Regression, task N is usually equal to 5). Moreover, you have a number F – number of features that will be randomly selected in each node of the Decision Tree. The feature that will be used to split the node is picked from these F features (for the Regression task, F is usually equal to sqrt (number of features of the original dataset D)
Everything else is rather simple. Random Forest creates K subsets of the data from the original dataset D. Samples that do not appear in any subset are called “out-of-bag” samples.
K trees are built using a single subset only. Also, each tree is built until there are fewer or equal to N samples in each node. Moreover, in each node F features are randomly selected. One of them is used to split the node
K trained models form an ensemble and the final result for the Regression task is produced by averaging the predictions of the individual trees

In the picture below you might see the Random Forest algorithm for Classification.

Are you aspiring for a booming career in IT? If YES, then dive in
Full Stack Developer Course	Python Programming Course	Data Science and Machine Learning Course

Advantages and disadvantages

To start with, let’s talk about the advantages. Random Forest is based on the Bagging technique that helps to promote the algorithm’s performance. Random Forest is no exception. It works well “out-of-the-box” with no hyperparameter tuning and way better than linear algorithms which makes it a good option. Moreover, Random Forest is rather fast, robust, and can show feature importances which can be quite useful.

Also, Random Forest limits the greatest disadvantage of Decision Trees. It almost does not overfit due to subset and feature randomization. Firstly, it uses a unique subset of the initial data for every base model which helps to make Decision Trees less correlated. Secondly, it splits each node in every Decision Tree using a random set of features. Such an approach means that no single tree sees all the data, which helps to focus on the general patterns within the training data,and reduces sensitivity to noise.

Nevertheless, Random Forest has disadvantages. Despite being an improvement over a single Decision Tree, there are more complex techniques than Random Forest. To tell the truth, the best prediction accuracy on difficult problems is usually obtained by Boosting algorithms.

Also, Random Forest is not able to extrapolate based on the data. The predictions it makes are always in the range of the training set. It is a major disadvantage as not every Regression problem can be solved using Random Forest. The Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set. Actually, that is why Random Forest is used mostly for the Classification task.

“Ready to take your python skills to the next level? Sign up for a free demo today!”

When to use Random Forest in real life

As mentioned above, Random Forest is used mostly to solve Classification problems. It is worth noting that Random Forest is rarely used in production simply because of other algorithms showing better performance. However, RF is a must-have algorithm for hypothesis testing as it may help you to get valuable insights. For example, the “out-of-the-box” Random Forest model was good enough to show a better performance on a difficult Fraud Detection task than a complex multi-model neural network.

From my experience, you might want to try Random Forest as your ML Classification algorithm to solve such problems as:

Fraud Detection (Classification) – please refer to the article I linked above. You may find it pretty thrilling as it shows how simple ML models can beat complex neural networks on an unobvious task
Credit Scoring (Classification) – an important solution in the banking sector. Some banks build enormous neural networks to improve this task. However, simple approaches might give the same result
E-commerce case (Classification) – for example, we can try to predict if the customer will like the product or not
Any Classification problem with a table data, for example, Kaggle competitions

In the Regression case, you should use Random Forest if:

It is not a time series problem
The data has a non-linear trend and extrapolation is not crucial

For example, Random Forest is frequently used in value prediction (value of a house or a packet of milk from a new brand).

“Get hands-on with our python course – sign up for a free demo!”

How to use Random Forest for Regression

Setting up

As mentioned above it is quite easy to use Random Forest. Fortunately, the sklearn library has the algorithm implemented both for the Regression and Classification task. You must use RandomForestRegressor() model for the Regression problem and RandomForestClassifier() for the Classification task.

Training

If you have ever trained a ML model using sklearn you will have no difficulties working with the RandomForestRegressor. All you need to do is to perform the fit method on your training set and the predict method on the test set.

random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

However, Random Forest in sklearn does not automatically handle the missing values. The algorithm will return an error if it finds any NaN or Null values in your data. If you want to check it for yourself please refer to the “Missing values” section of the notebook. Of course, you may easily drop all the samples with the missing values and continue training. Still, there are some non-standard techniques that will help you overcome this problem.

Overall, please do not forget about the EDA. It is always better to study your data, normalize it, handle the categorical features and the missing values before you even start training. That way you will be able to avoid many obstacles.

Tuning

In general, you should always tune your model as it must help to enhance the algorithm’s performance. As you might know, tuning is a really expensive process time-wise. When tuning a Random Forest model it gets even worse as you must train hundreds of trees multiple times for each parameter grid subset. So, you must not be afraid. Trust me, it is worth it.

You can easily tune a RandomForestRegressor model using GridSearchCV. If you are not sure what model hyperparameters you want to add to your parameter grid, please refer either to the sklearn official documentation or the Kaggle notebooks. Sklearn documentation will help you find out what hyperparameters the RandomForestRegressor has. Kaggle notebooks, on the other hand, will feature parameter grids of other users which may be quite helpful.

andom_forest_tuning = RandomForestRegressor(random_state = SEED)
param_grid = {
   'n_estimators': [100, 200, 500],
   'max_features': ['auto', 'sqrt', 'log2'],
   'max_depth' : [4,5,6,7,8],
   'criterion' :['mse', 'mae']
}
GSCV = GridSearchCV(estimator=random_forest_tuning, param_grid=param_grid, cv=5)
GSCV.fit(X_train, y_train)
GSCV.best_params_

Testing

When you have your model trained and tuned, it is time to test its final performance. Random Forest is just another Regression algorithm, so you can use all the regression metrics to assess its result.

For example, you might use MAE, MSE, MASE, RMSE, MAPE, SMAPE, and others. However, from my experience, MAE and MSE are the most commonly used. Both of them will be a good fit to evaluate the model’s performance. So, if you use them, keep in mind that the less is your error, the better and the error of the perfect model will be equal to zero.

random_forest = RandomForestRegressor(random_state = SEED)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
print('MAE: ', mean_absolute_error(y_test, y_pred))
print('MSE: ', mean_squared_error(y_test, y_pred))

Also, it is worth mentioning that you might not want to use any Cross-Validation technique to check the model’s ability to generalize. Some Data Scientists think that the Random Forest algorithm provides free Cross-Validation. You see, Random Forest randomizes the feature selection during each tree split, so that it does not overfit like other models. That is why using Cross-Validation on the Random Forest model might be unnecessary.

Still, if you want to use the Cross-Validation technique you can use the hold-out set concept. As mentioned before, samples from the original dataset that did not appear in any subset are called “out-of-bag” samples. They are a perfect fit for the hold-out set. Generally, using “out-of-bag” samples as a hold-out set will be enough for you to understand if your model generalizes well.

“Experience the power of our web development course with a free demo – enroll now!”

Final Thoughts

To summarize, we started with some theoretical information about Ensemble Learning, ensemble types, Bagging and Random Forest algorithms and went through a step-by-step guide on how to use Random Forest in Python for the Regression task. Also, we compared Random Forest with some other ML Regression algorithms. Lastly, we talked about some tips you may find useful when working with Random Forest.

How to Become a Data Scientist in India?	Data Analysis – Process, Methods, Types
How To Learn Python At Home?	Top Data Types of Python
Data Science Jobs in Kerala	Switch Case in Python
Top 10 Applications of Machine Learning in 2023	Best Full Stack Developer Course with Placement
Future of Python Developers	What is Data Interpretation? Methods and Benefits

Our Other Courses
MEP Course	Quantity Surveying Course	Montessori Teachers Training Course
Performance Marketing Course	Practical Accounting Course	Yoga Teachers Training Course

Random Forest Regression in Python

What is Random Forest?

“Experience the power of our web development course with a free demo – enroll now!”

Random Forest Algorithm

Are you aspiring for a booming career in IT? If YES, then dive in

Full Stack Developer Course

Python Programming Course

Data Science and Machine Learning Course

Advantages and disadvantages

When to use Random Forest in real life

“Get hands-on with our python course – sign up for a free demo!”

How to use Random Forest for Regression

Setting up

Training

Tuning

Testing

“Experience the power of our web development course with a free demo – enroll now!”

Final Thoughts

Related Articles

Feeba Mahin

Related Posts

Montessori Touch Boards: Explanation

The Montessori Brown Stair (Broad Stair): Activity and Extensions

Kerala Bank Clerk/Cashier Question Paper 2024 and Answer Key

Stemming and Lemmatization in Natural Language Processing

More to Explore

Practice Programs

Python Training in Different Cities

Free Tutorials For You

Courses

Company

Spoken English Courses

Quick Links

Other Courses

Popular Exam