分类决策树 回归决策树_决策树分类器背后的数学
分類決策樹 回歸決策樹
決策樹分類器背后的數學 (Maths behind Decision Tree Classifier)
Before we see the python implementation of the decision tree. Let’s first understand the math behind the decision tree classification. We will see how all the above-mentioned terms are used for splitting.
在我們看到決策樹的python實現之前。 首先讓我們了解決策樹分類背后的數學原理。 我們將看到如何使用所有上述術語進行拆分。
We will use a simple dataset which contains information about students from different classes and gender and see whether they stay in the school’s hostel or not.
我們將使用一個簡單的數據集,其中包含有關來自不同班級和性別的學生的信息,并查看他們是否留在學校的宿舍中。
This is how our data set looks like :
這就是我們的數據集的樣子:
Let’s try and understand how the root node is selected by calcualting gini impurity. We will use the above mentioned data.
讓我們嘗試了解如何通過計算基尼雜質來選擇根節點。 我們將使用上述數據。
We have two features which we can use for nodes: “Class” and “Gender”. We will calculate gini impurity for each of the features and then select that feature which has least gini impurity.
我們有兩個可用于節點的功能:“類”和“性別”。 我們將為每個特征計算基尼雜質,然后選擇基尼雜質最少的特征。
Let’s review the formula for calculating ginni impurity:
讓我們回顧一下計算ginni雜質的公式:
Let’s start with class, we will try to gini impurity for all different values in “class”.
讓我們從類開始,我們將嘗試為“類”中的所有不同值添加雜質。
This is how our Decision tree node is selected by calculating gini impurity for each node individually. If the number of feautures increases, then we just need to repeat the same steps after the selection of the root node.
這就是通過分別計算每個節點的基尼雜質來選擇我們的決策樹節點的方式。 如果功能數量增加,那么我們只需要在選擇根節點之后重復相同的步驟即可。
We will try and find the root nodes for the same dataset by calculating entropy and information gain.
我們將通過計算熵和信息增益來嘗試找到同一數據集的根節點。
DataSet:
數據集:
We have two features and we will try to choose the root node by calculating the information gain by splitting each feature.
我們有兩個功能,我們將嘗試通過拆分每個功能來計算信息增益來選擇根節點。
Let’ review the formula for entropy and information gain:
讓我們回顧一下熵和信息增益的公式:
Let’s start with feature “class” :
讓我們從功能“類”開始:
Let’ see the information gain from feature “gender” :
讓我們看看從“性別”功能獲得的信息:
決策樹的不同算法 (Different Algorithms for Decision Tree)
- ID3 (Iterative Dichotomiser) : It is one of the algorithms used to construct decision tree for classification. It uses Information gain as the criteria for finding the root nodes and splitting them. It only accepts categorical attributes. ID3(迭代二分器):這是用于構建決策樹以進行分類的算法之一。 它使用信息增益作為查找根節點并將其拆分的標準。 它僅接受分類屬性。
- C4.5 : It is an extension of ID3 algorithm, and better than ID3 as it deals both continuous and discreet values.It is also used for classfication purposes. C4.5:它是ID3算法的擴展,比ID3更好,因為它既處理連續值又處理離散值,也用于分類目的。
- Classfication and Regression Algorithm(CART) : It is the most popular algorithm used for constructing decison trees. It uses ginni impurity as the default calculation for selecting root nodes, however one can use “entropy” for criteria as well. This algorithm works on both regression as well as classfication problems. We will use this algorithm in our pyhton implementation. 分類和回歸算法(CART):這是用于構建決策樹的最流行算法。 它使用ginni雜質作為選擇根節點的默認計算,但是也可以使用“熵”作為標準。 該算法適用于回歸和分類問題。 我們將在pyhton實現中使用此算法。
Entropy and Ginni impurity can be used reversibly. It doesn’t affects the result much. Although, ginni is easier to compute than entropy, since entropy has a log term calculation. That’s why CART algorithm uses ginni as the default algorithm.
熵和Ginni雜質可以可逆地使用。 它對結果的影響不大。 盡管ginni比熵更容易計算,因為熵具有對數項計算。 這就是CART算法使用ginni作為默認算法的原因。
If we plot ginni vs entropy graph, we can see there is not much difference between them:
如果我們繪制ginni vs熵圖,我們可以看到它們之間沒有太大的區別:
Advantages of Decision Tree:
決策樹的優勢:
- It can be used for both Regression and Classification problems. 它可以用于回歸和分類問題。
- Decision Trees are very easy to grasp as the rules of splitting is clearly mentioned. 決策樹很容易掌握,因為明確提到了拆分規則。
- Complex decision tree models are very simple when visualized. It can be understood just by visualizing. 可視化時,復雜的決策樹模型非常簡單。 僅僅通過可視化就可以理解。
- Scaling and normalization are not needed. 不需要縮放和規范化。
Disadvantages of Decision Tree:
決策樹的缺點:
- A small change in data can cause instability in the model because of the greedy approach. 由于貪婪的方法,數據的微小變化會導致模型不穩定。
- Probability of overfitting is very high for Decision Trees. 對于決策樹,過度擬合的可能性非常高。
- It takes more time to train a decision tree model than other classification algorithms. 與其他分類算法相比,訓練決策樹模型需要更多時間。
翻譯自: https://medium.com/@er.amansingh2019/maths-behind-decision-tree-classifier-e3bfd5445540
分類決策樹 回歸決策樹
總結
以上是生活随笔為你收集整理的分类决策树 回归决策树_决策树分类器背后的数学的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 分析称全球智能手机销量最快今年第二季度回
- 下一篇: 美房产中介们爱上 ChatGPT:原先花