Entri Blog
No Result
View All Result
Friday, January 27, 2023
  • State Level PSC
    • Kerala PSC
    • TNPSC
    • APPSC
    • TSPSC
    • BPSC
    • Karnataka PSC
    • MPPSC
    • UPPSC
  • Banking
  • SSC
  • Railway
  • Entri Skilling
    • Coding
    • Spoken English
    • Stock Marketing
  • TET
    • APTET
    • CTET
    • DSSSB
    • Karnataka TET
    • Kerala TET
    • KVS
    • MPTET
    • SUPER TET
    • TNTET
    • TSTET
    • UPTET
FREE GK TEST: SIGNUP NOW
Entri Blog
  • State Level PSC
    • Kerala PSC
    • TNPSC
    • APPSC
    • TSPSC
    • BPSC
    • Karnataka PSC
    • MPPSC
    • UPPSC
  • Banking
  • SSC
  • Railway
  • Entri Skilling
    • Coding
    • Spoken English
    • Stock Marketing
  • TET
    • APTET
    • CTET
    • DSSSB
    • Karnataka TET
    • Kerala TET
    • KVS
    • MPTET
    • SUPER TET
    • TNTET
    • TSTET
    • UPTET
No Result
View All Result
Entri Blog
Free GK Test
banner top article banner top article
Home Articles

Random Forest Regression in Python

by Feeba Mahin
November 21, 2022
in Articles, Coding, Entri Elevate, Entri Skilling, Python Programming, Web and Android Development
Random Forest Regression in Python
Share on FacebookShare on WhatsAppShare on Telegram

Table of Contents

  • What is Random Forest?
  • Advantages and disadvantages
  • When to use Random Forest in real life
  • Final Thoughts

Every decision tree has high variance, but when we combine all of them together in parallel then the resultant variance is low as each decision tree gets perfectly trained on that particular sample data, and hence the output doesn’t depend on one decision tree but on multiple decision trees. In the case of a classification problem, the final output is taken by using the majority voting classifier. In the case of a regression problem, the final output is the mean of all the outputs. This part is called Aggregation.

Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap.

 

We need to approach the Random Forest regression technique like any other machine learning technique

  • Design a specific question or data and get the source to determine the required data.
  • Make sure the data is in an accessible format else convert it to the required format.
  • Specify all noticeable anomalies and missing data points that may be required to achieve the required data.
  • Create a machine learning model
  • Set the baseline model that you want to achieve
  • Train the data machine learning model.

Python and Machine Learning Square

Definitions:
Decision Trees
 are used for both regression and classification problems. They visually flow like trees, hence the name, and in the regression case, they start with the root of the tree and follow splits based on variable outcomes until a leaf node is reached and the result is given. An example of a decision tree is below:

Here we see a basic decision tree diagram which starts with the Var_1 and splits based off of specific criteria. When ‘yes’, the decision tree follows the represented path, when ‘no’, the decision tree goes down the other path. This process repeats until the decision tree reaches the leaf node and the resulting outcome is decided. For the example above, the values of a, b, c, or d could be representative of any numeric or categorical value.

Python and Machine Learning Square

What is Random Forest?

Random Forest is a Supervised learning algorithm that is based on the ensemble learning method and many Decision Trees. Random Forest is a Bagging technique, so all calculations are run in parallel and there is no interaction between the Decision Trees when building them. RF can be used to solve both Classification and Regression tasks.

The name “Random Forest” comes from the Bagging idea of data randomization (Random) and building multiple Decision Trees (Forest). Overall, it is a powerful ML algorithm that limits the disadvantages of a Decision Tree model.

Learn Coding in your Language! Enroll Here!

Random Forest Algorithm

To make things clear let’s take a look at the exact algorithm of the Random Forest:

  1. So, you have your original dataset D, you want to have K Decision Trees in our ensemble. Additionally, you have a number N – you will build a Tree until there are less or equal to N samples in each node (for the Regression, task N is usually equal to 5). Moreover, you have a number F – number of features that will be randomly selected in each node of the Decision Tree. The feature that will be used to split the node is picked from these F features (for the Regression task, F is usually equal to sqrt (number of features of the original dataset D)
  2. Everything else is rather simple. Random Forest creates K subsets of the data from the original dataset D. Samples that do not appear in any subset are called “out-of-bag” samples.
  3. K trees are built using a single subset only. Also, each tree is built until there are fewer or equal to N samples in each node. Moreover, in each node F features are randomly selected. One of them is used to split the node
  4. K trained models form an ensemble and the final result for the Regression task is produced by averaging the predictions of the individual trees

In the picture below you might see the Random Forest algorithm for Classification.

Advantages and disadvantages

To start with, let’s talk about the advantages. Random Forest is based on the Bagging technique that helps to promote the algorithm’s performance. Random Forest is no exception. It works well “out-of-the-box” with no hyperparameter tuning and way better than linear algorithms which makes it a good option. Moreover, Random Forest is rather fast, robust, and can show feature importances which can be quite useful.

Also, Random Forest limits the greatest disadvantage of Decision Trees. It almost does not overfit due to subset and feature randomization. Firstly, it uses a unique subset of the initial data for every base model which helps to make Decision Trees less correlated. Secondly, it splits each node in every Decision Tree using a random set of features. Such an approach means that no single tree sees all the data, which helps to focus on the general patterns within the training data,and reduces sensitivity to noise.

Nevertheless, Random Forest has disadvantages. Despite being an improvement over a single Decision Tree, there are more complex techniques than Random Forest. To tell the truth, the best prediction accuracy on difficult problems is usually obtained by Boosting algorithms.

Also, Random Forest is not able to extrapolate based on the data. The predictions it makes are always in the range of the training set. It is a major disadvantage as not every Regression problem can be solved using Random Forest. The Random Forest Regressor is unable to discover trends that would enable it in extrapolating values that fall outside the training set. Actually, that is why Random Forest is used mostly for the Classification task.

When to use Random Forest in real life

As mentioned above, Random Forest is used mostly to solve Classification problems. It is worth noting that Random Forest is rarely used in production simply because of other algorithms showing better performance. However, RF is a must-have algorithm for hypothesis testing as it may help you to get valuable insights. For example, the “out-of-the-box” Random Forest model was good enough to show a better performance on a difficult Fraud Detection task than a complex multi-model neural network.

From my experience, you might want to try Random Forest as your ML Classification algorithm to solve such problems as:

  1. Fraud Detection (Classification) – please refer to the article I linked above. You may find it pretty thrilling as it shows how simple ML models can beat complex neural networks on an unobvious task
  2. Credit Scoring (Classification) – an important solution in the banking sector. Some banks build enormous neural networks to improve this task. However, simple approaches might give the same result
  3. E-commerce case (Classification) – for example, we can try to predict if the customer will like the product or not
  4. Any Classification problem with a table data, for example, Kaggle competitions

In the Regression case, you should use Random Forest if:

  1. It is not a time series problem
  2. The data has a non-linear trend and extrapolation is not crucial

For example, Random Forest is frequently used in value prediction (value of a house or a packet of milk from a new brand).

Learn to code from industry experts! Enroll here

How to use Random Forest for Regression

Setting up

As mentioned above it is quite easy to use Random Forest. Fortunately, the sklearn library has the algorithm implemented both for the Regression and Classification task. You must use RandomForestRegressor() model for the Regression problem and RandomForestClassifier() for the Classification task.

Training

If you have ever trained a ML model using sklearn you will have no difficulties working with the RandomForestRegressor. All you need to do is to perform the fit method on your training set and the predict method on the test set.

random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test) 

However, Random Forest in sklearn does not automatically handle the missing values. The algorithm will return an error if it finds any NaN or Null values in your data. If you want to check it for yourself please refer to the “Missing values” section of the notebook. Of course, you may easily drop all the samples with the missing values and continue training. Still, there are some non-standard techniques that will help you overcome this problem.

Overall, please do not forget about the EDA. It is always better to study your data, normalize it, handle the categorical features and the missing values before you even start training. That way you will be able to avoid many obstacles.

Tuning

In general, you should always tune your model as it must help to enhance the algorithm’s performance. As you might know, tuning is a really expensive process time-wise. When tuning a Random Forest model it gets even worse as you must train hundreds of trees multiple times for each parameter grid subset. So, you must not be afraid. Trust me, it is worth it.

You can easily tune a RandomForestRegressor model using GridSearchCV. If you are not sure what model hyperparameters you want to add to your parameter grid, please refer either to the sklearn official documentation or the Kaggle notebooks. Sklearn documentation will help you find out what hyperparameters the RandomForestRegressor has. Kaggle notebooks, on the other hand, will feature parameter grids of other users which may be quite helpful.

andom_forest_tuning = RandomForestRegressor(random_state = SEED)
param_grid = {
   'n_estimators': [100, 200, 500],
   'max_features': ['auto', 'sqrt', 'log2'],
   'max_depth' : [4,5,6,7,8],
   'criterion' :['mse', 'mae']
}
GSCV = GridSearchCV(estimator=random_forest_tuning, param_grid=param_grid, cv=5)
GSCV.fit(X_train, y_train)
GSCV.best_params_ 

Testing

When you have your model trained and tuned, it is time to test its final performance. Random Forest is just another Regression algorithm, so you can use all the regression metrics to assess its result.

For example, you might use MAE, MSE, MASE, RMSE, MAPE, SMAPE, and others. However, from my experience, MAE and MSE are the most commonly used. Both of them will be a good fit to evaluate the model’s performance. So, if you use them, keep in mind that the less is your error, the better and the error of the perfect model will be equal to zero.

random_forest = RandomForestRegressor(random_state = SEED)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
print('MAE: ', mean_absolute_error(y_test, y_pred))
print('MSE: ', mean_squared_error(y_test, y_pred)) 

Also, it is worth mentioning that you might not want to use any Cross-Validation technique to check the model’s ability to generalize. Some Data Scientists think that the Random Forest algorithm provides free Cross-Validation. You see, Random Forest randomizes the feature selection during each tree split, so that it does not overfit like other models. That is why using Cross-Validation on the Random Forest model might be unnecessary.

Still, if you want to use the Cross-Validation technique you can use the hold-out set concept. As mentioned before, samples from the original dataset that did not appear in any subset are called “out-of-bag” samples. They are a perfect fit for the hold-out set. Generally, using “out-of-bag” samples as a hold-out set will be enough for you to understand if your model generalizes well.

Grab the opportunity to learn Python with Entri! Click Here

Final Thoughts

To summarize, we started with some theoretical information about Ensemble Learning, ensemble types, Bagging and Random Forest algorithms and went through a step-by-step guide on how to use Random Forest in Python for the Regression task. Also, we compared Random Forest with some other ML Regression algorithms. Lastly, we talked about some tips you may find useful when working with Random Forest.

Python and Machine Learning Rectangle

 

Share64SendShare
Feeba Mahin

Feeba Mahin

Related Posts

Kerala PSC Junior Public Health Nurse Gr. II Rank List 2022 Out - PDF, Direct Link
Articles

Kerala PSC Junior Public Health Nurse Gr. II Rank List 2022 Out – PDF, Direct Link

January 25, 2023
HPSC HCS Personality Test 2021- 22 Date Out - Check Schedule
Articles

HPSC HCS Personality Test 2021- 22 Date Out – Check Schedule

January 25, 2023
GPSC Engineering Service Answer Key 2023 PDF - Link, Raise Objection
Articles

GPSC Engineering Service Answer Key 2023 PDF – Link, Raise Objection

January 25, 2023
Next Post
Stemming and Lemmatization in Natural Language Processing

Stemming and Lemmatization in Natural Language Processing

Discussion about this post

Latest Posts

  • Kerala PSC Junior Public Health Nurse Gr. II Rank List 2022 Out – PDF, Direct Link
  • HPSC HCS Personality Test 2021- 22 Date Out – Check Schedule
  • GPSC Engineering Service Answer Key 2023 PDF – Link, Raise Objection
  • Kerala PSC Exam Calendar April 2023 – Download PDF Here, Link
  •  8 Interview Skills To Ace Your Next Interview

Trending Posts

  • states of india and their capitals and languages

    List of 28 States of India and their Capitals and Languages 2023 – PDF Download

    149714 shares
    Share 59883 Tweet 37427
  • List of Government Banks in India 2023: All you need to know

    60986 shares
    Share 24394 Tweet 15247
  • TNPSC Group 2 Posts and Salary Details 2022

    39409 shares
    Share 15764 Tweet 9852
  • New Map of India with States and Capitals 2022

    28551 shares
    Share 11420 Tweet 7138
  • Odisha Police Recruitment 2023 PDF Download for 4790 Posts – Eligibility, Selection Process

    833 shares
    Share 333 Tweet 208

Company

  • Become a teacher
  • Login to Entri Web

Quick Links

  • Articles
  • Videos
  • Entri Daily Quiz Practice
  • Current Affairs & GK
  • News Capsule – eBook
  • Preparation Tips
  • Kerala PSC Gold
  • Entri Skilling

Popular Exam

  • IBPS Exam
  • SBI Exam
  • Railway RRB Exam
  • Kerala PSC
  • Tamil Nadu PSC
  • Telangana PSC
  • Andhra Pradesh PSC
  • MPPSC
  • UPPSC
  • Karnataka PSC
  • Staff Selection Commission Exam

© 2021 Entri.app - Privacy Policy | Terms of Service

No Result
View All Result
  • State Level PSC
    • Kerala PSC
    • TNPSC
    • APPSC
    • TSPSC
    • BPSC
    • Karnataka PSC
    • MPPSC
    • UPPSC
  • Banking
  • SSC
  • Railway
  • Entri Skilling
    • Coding
    • Spoken English
    • Stock Marketing
  • TET
    • APTET
    • CTET
    • DSSSB
    • Karnataka TET
    • Kerala TET
    • KVS
    • MPTET
    • SUPER TET
    • TNTET
    • TSTET
    • UPTET

© 2021 Entri.app - Privacy Policy | Terms of Service