How to Avoid Overfitting When Using a Random Forest
In the world of machine learning, one common pitfall is overfitting. Random forests, in particular, have been known to suffer from this issue. Fortunately, it’s possible to avoid this. In this article, we will discuss different ways to avoid overfitting when using a random forest.
A random forest (also known as a random decision tree) is a type of machine learning technique, where a collection of decision trees are created using different subsets of the data. In other words, the ensemble will create a number of prediction models using some or all of your available information and then use their combined predictions (via voting) to make a final prediction. This approach can lead to increased accuracy over single decision trees depending on the structure and properties of the model used.
How does it work?
Random forests are created by taking samples from the training dataset in order to create multiple decision trees, but with each tree trained on a different subset (or sample) of features. Once trained, each tree makes a prediction on new data and the final output is a combination of all those predictions.
Random forests also allow you to use a specific type of model called bagging, which is an ensemble method where each tree in a forest is created from a bootstrap sample (i.e., randomly sampled with replacement) of the training dataset, meaning that the trees are trained on slightly different datasets (than the original). The difference in the created trees allows for increased variability in the input, which can help reduce the risk of overfitting.
Why do random forests suffer from overfitting?
As opposed to decision trees, the model complexity in the random forest is determined by only one thing — the number of features. So, if some feature has a strong correlation with the dependent variable, it can be split into many different bins, and then each bin will get its own tree!
Overfitting happens when we have very strong correlations between a dependent variable and some feature. It can be caused by:
Sparse features: if we have only a few different values for some features, it’s easy to fit this predictor to future observations. For example, when we analyze the survival time probability of patients with a given cancer type, if all patients are combined into one bin then there is no way to identify subgroups with different survival times.
Overfitting by correlated variables: it’s also very easy to overfit when we have two strongly correlated predictors, for example, education and income. If we want to predict what is the average annual income of people who got a college degree, then there is no need to evaluate the impact of high-school diplomas because they might have some value!
Faulty concept: perhaps you are using the wrong dependent variable or some other concepts that may change over time (e.g. — in marketing — if you target people who like soccer). In this case, your model won’t be able to see new data points and will fit old observations only. But still, it would be better than random overfitting!
Check you have a good train/test split and estimators are unbiased. For example, if your trees are overfitting — try to reduce the number of trees. If your features overfit — remove them.
Overfitting is related to Ensemble Learning (Ensemble methods). In this case, we want our model(s) to do better than any individual model itself. We’ll see that the Random Forest algorithm tries to avoid it by building several models with different subsets of data samples (=reducing dataset dimensionality), so after feature selection, we get a very simple model built on top of more complex models which will definitely give us better results in some situations. It’s worth mentioning that during the training process each tree is evaluated independently, so we need to check the model complexity and bias of each tree as well!
Another situation where we can get overfitting is when our data has outliers. In that case, depending on the distribution of trees (e.g. — Gaussian distribution vs Poisson one), trees with many samples from one side will be chosen which may lead to biased prediction on another side. In that case, reducing data dimensionality will help us to remove the component of overfitting as well.
Conclusion
Do not try to overfit the model, this will not work! — Try to prevent/reduce overfitting during the training process. If you have some kind of anomaly in your data (outliers, strong correlations between features, etc.), then it’s better to reduce dataset dimensionality or check if different algorithms would fit better to your data.
Originally published at https://protonautoml.com on July 16, 2021.