随机森林模型解释_随机森林解释
隨機森林模型解釋
In this post, we will explain what a Random Forest model is, see its strengths, how it is built, and what it can be used for.
在這篇文章中,我們將解釋什么是隨機森林模型,了解其優勢,如何構建以及可用于什么。
We will go through the theory and intuition of Random Forest, seeing the minimum amount of maths necessary to understand how everything works, without diving into the most complex details.
我們將通過“隨機森林”的理論和直覺,了解在不深入了解最復雜細節的情況下了解所有工作原理所需的最少數學量。
Let’s get to it!
讓我們開始吧!
1.簡介 (1. Introduction)
In the Machine Learning world, Random Forest models are a kind of non parametric models that can be used both for regression and classification. They are one of the most popular ensemble methods, belonging to the specific category of Bagging methods.
在機器學習世界中,隨機森林模型是一種非參數模型,可用于回歸和分類。 它們是最流行的合奏方法之一,屬于Bagging方法的特定類別。
Ensemble methods involve using many learners to enhance the performance of any single one of them individually. These methods can be described as techniques that use a group of weak learners (those who on average achieve only slightly better results than a random model) together, in order to create a stronger, aggregated one.
集成方法涉及使用許多學習者來分別提高其中任何一個學習者的表現。 這些方法可以描述為一種技術,它使用一組弱勢學習者(那些學習者平均僅比隨機模型獲得更好的結果),以創建一個更強大的匯總者。
In our case, Random Forests are an ensemble of many individual Decision Trees. If you are not familiar with Decision Trees, you can learn all about them here:
在我們的案例中,隨機森林是許多單獨的決策樹的集合。 如果您不熟悉決策樹,則可以在這里了解所有有關決策樹的信息:
One of the main drawbacks of Decision Trees is that they are very prone to over-fitting: they do well on training data, but are not so flexible for making predictions on unseen samples. While there are workarounds for this, like pruning the trees, this reduces their predictive power. Generally they are models with medium bias and high variance, but they are simple and easy to interpret.
決策樹的主要缺點之一是它們很容易過度擬合:它們在訓練數據上表現出色,但是對于對看不見的樣本進行預測并不那么靈活。 盡管有一些解決方法,例如修剪樹木,但這會降低其預測能力。 通常,它們是具有中等偏差和高方差的模型,但是它們簡單易懂。
If you are not very confident with the difference between bias and variance, check out the following post:
如果您對偏差和方差之間的差異不是很有把握,請查看以下文章:
Random Forest models combine the simplicity of Decision Trees with the flexibility and power of an ensemble model. In a forest of trees, we forget about the high variance of an specific tree, and are less concerned about each individual element, so we can grow nicer, larger trees that have more predictive power than a pruned one.
隨機森林模型將決策樹的簡單性與集成模型的靈活性和強大功能相結合。 在一片森林中,我們會忘記一棵特定樹木的高變異性,而不必擔心每個元素,因此我們可以種植更好的,更大的樹木,比修剪的樹木具有更大的預測能力。
Al tough Random Forest models don’t offer as much interpret ability as a single tree, their performance is a lot better, and we don’t have to worry so much about perfectly tuning the parameters of the forest as we do with individual trees.
嚴格的隨機森林模型沒有提供像單棵樹那么多的解釋能力,它們的性能要好得多,而且我們不必像單獨樹一樣擔心完美調整森林的參數。
Okay, I get it, a Random Forest is a collection of individual trees. But why the name Random? Where is the Randomness? Lets find out by learning how a Random Forest model is built.
好的,我明白了,隨機森林是單個樹木的集合。 但是為什么命名為Random? 隨機性在哪里? 通過學習如何建立隨機森林模型來找出答案。
2.訓練和建立隨機森林 (2. Training and Building a Random Forest)
Building a random Forest has 3 main phases. We will break down each of them and clarify each of the concepts and steps. Lets go!
建立隨機森林有3個主要階段。 我們將分解每個概念,并闡明每個概念和步驟。 我們走吧!
2.1為每棵樹創建一個引導數據集 (2.1 Creating a Bootstrapped Data Set for each tree)
When we build an individual decision tree, we use a training data set and all of the observations. This means that if we are not careful, the tree can adjust very well to this training data, and generalise badly to new, unseen observations. To solve this issue, we stop the tree from growing very large, usually at the cost of reducing its performance.
當我們建立一個單獨的決策樹時,我們使用訓練數據集和所有觀察值。 這意味著,如果我們不小心,則樹可以很好地適應此訓練數據,并嚴重地推廣到新的,看不見的觀測結果。 為了解決此問題,我們通常以降低其性能為代價,阻止樹變得很大。
To build a Random Forest we have to train N decision trees. Do we train the trees using the same data all the time? Do we use the whole data set? Nope.
要建立一個隨機森林,我們必須訓練N個決策樹。 我們是否一直使用相同的數據訓練樹木? 我們是否使用整個數據集? 不。
This is where the first random feature comes in. To train each individual tree, we pick a random sample of the entire Data set, like shown in the following figure.
這是第一個隨機特征出現的地方。要訓練每棵樹,我們選擇整個數據集的隨機樣本,如下圖所示。
Flaticon.Flaticon的圖標。From looking at this figure, various things can be deduced. First of all, the size of the data used to train each individual tree does not have to be the size of the whole data set. Also, a data point can be present more than once in the data used to train a single tree (like in tree no two).
從這個數字看,可以得出各種結論。 首先,用于訓練每棵單獨的樹的數據的大小不必一定是整個數據集的大小。 同樣,一個數據點可以在用于訓練單棵樹的數據中不止一次出現(例如在第二棵樹中)。
This is called Sampling with Replacement or Bootstrapping: each data point is picked randomly from the whole data set, and a data point can be picked more than once.
這稱為替換或自舉抽樣:從整個數據集中隨機抽取每個數據點,并且一個數據點可以被多次抽取。
By using different samples of data to train each individual tree we reduce one of the main problems that they have: they are very fond of their training data. If we train a forest with a lot of trees and each of them has been trained with different data, we solve this problem. They are all very fond of their training data, but the forest is not fond of any specific data point. This allows us to grow larger individual trees, as we do not care so much anymore for an individual tree overfitting.
通過使用不同的數據樣本來訓練每棵樹,我們減少了它們的主要問題之一:他們非常喜歡他們的訓練數據。 如果我們用許多樹木訓練一個森林,并且每個樹木都接受了不同的數據訓練,那么我們可以解決這個問題。 他們都很喜歡他們的訓練數據,但是森林并不喜歡任何特定的數據點。 這使我們可以生長更大的單棵樹,因為我們不再關心單個樹的過度擬合。
If we use a very small portion of the whole data set to train each individual tree, we increase the randomness of the forest (reducing over-fitting) but usually at the cost of a lower performance.
如果我們使用整個數據集的一小部分來訓練每棵樹,則會增加森林的隨機性(減少過度擬合),但通常以降低性能為代價。
In practice, by default most Random Forest implementations (like the one from Scikit-Learn) pick the sample of the training data used for each tree to be the same size as the original data set (however it is not the same data set, remember that we are picking random samples).
實際上,默認情況下,大多數隨機森林實現(例如Scikit-Learn的實現)都會選擇用于每棵樹的訓練數據的樣本,使其大小與原始數據集的大小相同(但請注意,它不是同一數據集)我們正在挑選隨機樣本)。
This generally provides a good bias-variance compromise.
通常,這提供了良好的偏差方差折衷。
2.2使用這些隨機數據集訓練樹木,并通過特征選擇增加一些隨機性 (2.2 Train a forest of trees using these random data sets, and add a little more randomness with the feature selection)
If you remember well, for building an individual decision tree, at each node we evaluated a certain metric (like the Gini index, or Information Gain) and picked the feature or variable of the data to go in the node that minimised/maximised this metric.
如果您沒記錯的話,為了構建單獨的決策樹,我們在每個節點上評估了某個指標(如Gini索引或Information Gain),并選擇了數據的特征或變量放入將指標最小化/最大化的節點中。
This worked decently well when training only one tree, but now we want a whole forest of them! How do we do it? Ensemble models, like Random Forest work best if the individual models (individual trees in our case) are uncorrelated. In Random Forest this is achieved by randomly selecting certain features to evaluate at each node.
僅訓練一棵樹時,這種方法效果很好,但是現在我們需要一整棵樹! 我們該怎么做呢? 如果各個模型(在本例中為單個樹)不相關,則像隨機森林這樣的集成模型最有效。 在隨機森林中,這是通過隨機選擇某些要在每個節點上進行評估的特征來實現的。
Flaticon.Flaticon的圖標。As you can see from the previous image, at each node we evaluate only a subset of all the initial features. For the root node we take into account E, A and F (and F wins). In Node 1 we consider C, G and D (and G wins). Lastly, in Node 2 we consider only A, B, and G (and A wins). We would carry on doing this until we built the whole tree.
從上一張圖像可以看到,在每個節點處,我們僅評估所有初始特征的一個子集。 對于根節點,我們考慮E,A和F(以及F獲勝)。 在節點1中,我們考慮C,G和D(以及G獲勝)。 最后,在節點2中,我們僅考慮A,B和G(而A獲勝)。 我們將繼續執行此操作,直到我們構建完整的樹為止。
By doing this, we avoid including features that have a very high predictive power in every tree, while creating many un-correlated trees. This is the second sweep of randomness. We do not only use random data, but also random features when building each tree. The greater the tree diversity, the better: we reduce the variance, and get a better performing model.
通過這樣做,我們避免在每棵樹中包含具有很高預測能力的要素,同時創建許多不相關的樹。 這是隨機性的第二次掃描。 構建每棵樹時,我們不僅使用隨機數據,還使用隨機特征。 樹的多樣性越大,越好:我們減少方差,并獲得性能更好的模型。
2.3對N棵樹重復此操作,以創建我們的令人敬畏的森林。 (2.3 Repeat this for the N trees to create our awesome forest.)
Awesome, we have learned how to build a single decision tree. Now, we would repeat this for the N trees, randomly selecting on each node of each of the trees which variables enter the contest for being picked as the feature to split on.
太棒了,我們已經學習了如何構建單個決策樹。 現在,我們將對N棵樹重復此操作,在每棵樹的每個節點上隨機選擇哪些變量進入競賽以被選為分割要素。
In conclusion, the whole process goes as follows:
總之,整個過程如下:
Once we have built our forest, we are ready to use it to make awesome predictions. Lets see how!
建立森林后,我們就可以使用它做出令人敬畏的預測。 讓我們看看如何!
3.使用隨機森林進行預測 (3. Making predictions using a Random Forest)
Making predictions with a Random Forest is very easy. We just have to take each of our individual trees, pass the observation for which we want to make a prediction through them, get a prediction from every tree (summing up to N predictions) and then obtain an overall, aggregated prediction.
使用隨機森林進行預測非常容易。 我們只需要獲取我們每棵單獨的樹,通過我們想要對其進行預測的觀測值,從每棵樹中獲取一個預測(總計N個預測),然后獲得總體的,匯總的預測。
Bootstrapping the data and then using an aggregate to make a prediction is called Bagging, and how this prediction is made depends on the kind of problem we are facing.
引導數據,然后使用聚合進行預測稱為Bagging,如何進行此預測取決于我們面臨的問題類型。
For regression problems, the aggregate decision is the average of the decisions of every single decision tree. For classification problems, the final prediction is the most frequent prediction done by the forest.
對于回歸問題,總決策是每個決策樹的決策的平均值。 對于分類問題,最終預測是森林所做的最頻繁的預測。
Flaticon.Flaticon的圖標。The previous image illustrates this very simple procedure. For the classification problem we want to predict if a certain patient is sick or healthy. For this we pass his medical record and other information through each tree of the random forest, and obtain N predictions (400 in our case). In our example 355 of the trees say that the patient is healthy and 45 say that the patient is sick, therefore the forest decides that the patient is healthy.
上一張圖片說明了此非常簡單的過程。 對于分類問題,我們要預測某個患者是否生病或健康。 為此,我們通過隨機森林的每棵樹傳遞他的病歷和其他信息,并獲得N個預測(本例中為400個)。 在我們的示例中,有355棵樹表示患者健康,而45棵樹表示患者生病,因此森林確定該患者健康。
For the regression problem we want to predict the price of a certain house. We pass the characteristics of this new house through our N trees, getting a numerical prediction from each of them. Then, we calculate the average of these predictions and get the final value of 322.750$.
對于回歸問題,我們要預測某個房屋的價格。 我們通過N棵樹傳遞這棟新房子的特征,并從每棵樹中獲得數值預測。 然后,我們計算這些預測的平均值,得到最終值322.750 $。
Simple right? We make a prediction with every individual tree and then aggregate these predictions using the mean (average) or the mode (most frequent value).
簡單吧? 我們對每棵樹進行預測,然后使用均值(平均值)或眾數(最頻繁值)匯總這些預測。
4.結論和其他資源 (4. Conclusion and other resources)
In this post we have seen what a Random Forest is, how it overcomes the main issues of Decision Trees, how they are trained, and used to make predictions. They are very flexible and powerful Machine Learning models that are highly used in commercial and industrial applications, along with Boosting models and Artificial Neural Networks.
在這篇文章中,我們了解了什么是隨機森林,如何克服決策樹的主要問題,如何訓練決策樹以及如何進行預測。 它們是非常靈活且功能強大的機器學習模型,與Boosting模型和人工神經網絡一起廣泛用于商業和工業應用中。
On future posts we will explore tips and tricks of Random Forests and how they can be used for feature selection. Also, if you want to see precisely how they are built, check out the following video by StatQuest, its great:
在以后的文章中,我們將探討隨機森林的技巧和竅門,以及如何將其用于特征選擇。 另外,如果您想確切地了解它們的構建方式,請觀看StatQuest的以下視頻,它很棒:
演示地址
That is it! As always, I hope you enjoyed the post. If you did feel free to follow me on Twitter at @jaimezorno. Also, you can take a look at my other posts on Data Science and Machine Learning here, and subscribe to my newsletter to get awesome goodies and notifications on new posts!
這就對了! 一如既往,希望您喜歡這個職位。 如果您愿意在Twitter上關注我,請訪問@jaimezorno 。 此外,您還可以看看我的關于數據科學和機器學習等職位這里,和訂閱我的通訊,以獲得新的職位真棒糖果和通知!
演示地址
For further resources on Machine Learning and Data Science check out the following repository: How to Learn Machine Learning!
有關機器學習和數據科學的更多資源,請查看以下存儲庫:如何學習機器學習!
翻譯自: https://towardsdatascience.com/random-forest-explained-7eae084f3ebe
隨機森林模型解釋
總結
以上是生活随笔為你收集整理的随机森林模型解释_随机森林解释的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: HID设备类详解
- 下一篇: 《2017 云计算评测报告》:带你了解