决策树算法及可视化
經典決策樹算法包括ID3算法、C4.5算法以及GBDT的基分類器CART算法?,ID3算法選擇特征的依據是信息增益、C4.5是信息增益比,而CART則是Gini指數。
例子:?
?
?? ? ? ? ? ? ? ? ? ? ? ??
所謂信息增益就是數據在得到特征X的信息時使得類Y的信息不確定性減少的程度。假設數據集D的信息熵為H(D),給定特征A之后的條件熵為H(D|A),則特征A對于數據集的信息增益g(D,A)可表示為:
?? 信息增益越大,則該特征對數據集確定性貢獻越大,表示該特征對數據有較強的分類能力。
#出太陽打球信息熵 S_entropy = -(0.6*np.log2(0.6) + 0.4*np.log2(0.4)) print('S_entropy:', S_entropy) #陰天打球信息熵 O_entropy = -(1*np.log2(1)) print('O_entropy:', O_entropy) #下雨天打球信息熵 R_entropy = -(0.6*np.log2(0.6) + 0.4*np.log2(0.4)) print('R_entropy:', R_entropy)#打球的信息熵 play_entropy = -(5/14*np.log2(5/14) + 9/14*np.log2(9/14)) print('play_entropy:', play_entropy)#在天氣這個維度 打球的條件熵 Play_outlook_Conditional_entropy = 5/14*S_entropy+4/14*O_entropy+5/14*R_entropy print('Play_outlook_Conditional_entropy:', Play_outlook_Conditional_entropy)#在天氣這個維度 打球的信息增益 Gain_play_outlook = play_entropy - Play_outlook_Conditional_entropy print('Gain_play_outlook:', Gain_play_outlook)以ID3算法構建決策樹過程:
#coding:utf-8 import numpy as np import pandas as pd from math import log#計算信息熵 def entropy(ele):probs = [ele.count(i) / len(ele) for i in set(ele)]# print('probs:', probs)entropy = -sum([prob * log(prob, 2) for prob in probs])return entropy#拆分dataframe的key值 def split_dataframe(data, col):unique_values = data[col].unique()# print('unique_values:', unique_values)result_dict = {elem: pd.DataFrame for elem in unique_values}# print('result_dict:', result_dict)# assert 1 == 0for key in result_dict.keys():result_dict[key] = data[:][data[col] == key]return result_dict #獲取信息增益最大的列 def choose_best_col(data, label):# print('==data[label].tolist():', data[label].tolist())entropy_D = entropy(data[label].tolist())cols = [col for col in data.columns if col not in [label]]max_value, best_col = -999, Nonemax_splited = Nonefor col in cols:# print('col:', col)splited_set = split_dataframe(data, col)print('splited_set:\n', splited_set)# print('==========\n', splited_set['normal'])entropy_DA = 0for subset_col, subset in splited_set.items():print('============subset_col, subset\n', subset_col, subset)# assert 1 == 0entropy_Di = entropy(subset[label].tolist())entropy_DA += len(subset) / len(data) * entropy_Diinfo_gain = entropy_D - entropy_DAif info_gain > max_value:max_value, best_col = info_gain, colmax_splited = splited_setreturn max_value, best_col, max_splitedclass ID3Tree:class Node:def __init__(self, name):self.name = nameself.connections = {}def connect(self, label, node):self.connections[label] = nodedef __init__(self, data, label):self.columns = data.columns# print('self.columns:', self.columns)self.data = dataself.label = labelself.root = self.Node("Root")def print_tree(self, node, tabs):# print('tabs + node.name:\n', tabs + node.name)for connection, child_node in node.connections.items():print(tabs + "\t" + "(" + connection + ")")self.print_tree(child_node, tabs + "\t\t")def construct_tree(self):self.construct(self.root, "", self.data, self.columns)def construct(self, parent_node, parent_connection_label, input_data, columns):max_value, best_col, max_splited = choose_best_col(input_data[columns], self.label)# assert 1==0print('==best_col:', best_col)print('==not best_col:', not best_col)if not best_col:node = self.Node(input_data[self.label].iloc[0])parent_node.connect(parent_connection_label, node)returnnode = self.Node(best_col)parent_node.connect(parent_connection_label, node)new_columns = [col for col in columns if col != best_col]for splited_value, splited_data in max_splited.items():self.construct(node, splited_value, splited_data, new_columns)def test_best_gain_info():df = pd.read_csv('./example_data.csv', dtype={'windy' : 'str'})max_value, best_col, max_splited = choose_best_col(df, 'play')print('max_value, best_col, max_splited:\n', max_value, best_col, max_splited)def test_tree_construct():df = pd.read_csv('./example_data.csv', dtype={'windy': 'str'})id3 = ID3Tree(df, 'play')id3.construct_tree()id3.print_tree(id3.root, '') if __name__ == '__main__':# test_best_gain_info()test_tree_construct()以cart構建分類樹過程:
?Gini指數是針對概率分布而言的。假設在一個分類問題中有K個類,樣本屬于第k個類的概率為Pk,則該樣本概率分布的基尼指數為:
相應的條件Gini指數,也即給定特征A的條件下集合D的Gini指數計算如下:
import numpy as np import pandas as pddef gini(nums):probs = [nums.count(i)/len(nums) for i in set(nums)]gini = sum([p*(1-p) for p in probs])return ginidef split_dataframe(data, col):'''function: split pandas dataframe to sub-df based on data and column.input: dataframe, column name.output: a dict of splited dataframe.'''# unique value of columnunique_values = data[col].unique()# print('==unique_values:', unique_values)# empty dict of dataframeresult_dict = {elem: pd.DataFrame for elem in unique_values}# split dataframe based on column valuefor key in result_dict.keys():result_dict[key] = data[:][data[col] == key]return result_dictdef choose_best_col(df, label):'''funtion: choose the best column based on infomation gain.input: datafram, labeloutput: max infomation gain, best column,splited dataframe dict based on best column.'''# Calculating label's gini indexgini_D = gini(df[label].tolist())# columns list except labelcols = [col for col in df.columns if col not in [label]]# initialize the max infomation gain, best column and best splited dictmin_value, best_col = 999, Nonemin_splited = None# split data based on different columnfor col in cols:splited_set = split_dataframe(df, col)gini_DA = 0for subset_col, subset in splited_set.items():# calculating splited dataframe label's gini indexgini_Di = gini(subset[label].tolist())# calculating gini index of current featuregini_DA += len(subset) / len(df) * gini_Diif gini_DA < min_value:min_value, best_col = gini_DA, colmin_splited = splited_setreturn min_value, best_col, min_spliteddef test_gini():lst = ['a', 'b', 'c', 'd', 'b', 'c', 'a', 'b', 'c', 'd', 'a']res = gini(lst)print('=res:', res)def test_csv_gini():df = pd.read_csv('./example_data.csv', dtype={'windy': 'str'})res = gini(df['play'].tolist())print('=res:', res)def test_split_dataframe():df = pd.read_csv('./example_data.csv', dtype={'windy': 'str'})res = split_dataframe(df, 'temp')print('=res:', res.keys())print("=====res['mild']:\n", res['mild'])def test_choose_best_col():df = pd.read_csv('./example_data.csv', dtype={'windy': 'str'})min_value, best_col, min_splited = choose_best_col(df, 'play')print('==min_value:', min_value)print('==best_col:', best_col)print('==min_splited:', min_splited)class CartTree:# define a Node classclass Node:def __init__(self, name):self.name = nameself.connections = {}def connect(self, label, node):self.connections[label] = nodedef __init__(self, data, label):self.columns = data.columnsself.data = dataself.label = labelself.root = self.Node("Root")# print tree methoddef print_tree(self, node, tabs):print(tabs + node.name)for connection, child_node in node.connections.items():print(tabs + "\t" + "(" + connection + ")")self.print_tree(child_node, tabs + "\t\t")def construct_tree(self):self.construct(self.root, "", self.data, self.columns)# construct treedef construct(self, parent_node, parent_connection_label, input_data, columns):min_value, best_col, min_splited = choose_best_col(input_data[columns], self.label)if not best_col:node = self.Node(input_data[self.label].iloc[0])parent_node.connect(parent_connection_label, node)returnnode = self.Node(best_col)parent_node.connect(parent_connection_label, node)new_columns = [col for col in columns if col != best_col]# Recursively constructing decision treesfor splited_value, splited_data in min_splited.items():self.construct(node, splited_value, splited_data, new_columns)def test_construct_tree():df = pd.read_csv('./example_data.csv', dtype={'windy': 'str'})tree1 = CartTree(df, 'play')tree1.construct_tree()tree1.print_tree(tree1.root, "") if __name__ == '__main__':# test_gini()# test_csv_gini()# test_split_dataframe()# test_choose_best_col()test_construct_tree()?
安裝graphviz用于可視化決策樹
apt-get install graphviz
from sklearn.tree import DecisionTreeClassifier import pydotplus from sklearn import treeX = np.array([[2, 2],[2, 1],[2, 3],[1, 2],[1, 1],[3, 3]])y = np.array([0, 1, 1, 1, 0, 1])plt.style.use('fivethirtyeight') plt.rcParams['font.size'] = 18 plt.figure(figsize=(8, 8))# Plot each point as the label for x1, x2, label in zip(X[:, 0], X[:, 1], y):plt.text(x1, x2, str(label), fontsize=40, color='g',ha='center', va='center')plt.grid(None) plt.xlim((0, 3.5)) plt.ylim((0, 3.5)) plt.xlabel('x1', size=20) plt.ylabel('x2', size=20) plt.title('Data', size=24) # plt.show()dec_tree = DecisionTreeClassifier() print(dec_tree) dec_tree.fit(X, y) print(dec_tree.score(X,y)) # Export as dot dot_data = tree.export_graphviz(dec_tree, out_file=None,feature_names=['x1', 'x2'],class_names=['0', '1'],filled=True, rounded=True,special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data) with open('1.png', 'wb') as f:f.write(graph.create_png())除葉節點(終端節點)之外的所有節點都有 5 部分:
基于一個特征的值的有關數據的問題。每個問題的答案要么是 True,要么就是 False。數據點會根據該問題的答案在該決策樹中移動。
gini:節點的基尼不純度。當沿著樹向下移動時,平均加權的基尼不純度必須降低。
samples:節點中觀察的數量。
value:每一類別中樣本的數量。比如,頂部節點中有 2 個樣本屬于類別 0,有 4 個樣本屬于類別 1。
class:節點中大多數點的類別(持平時默認為 0)。在葉節點中,這是該節點中所有樣本的預測結果。
一個節點的基尼不純度的公式為:?
root節點的計算:
在這個決策樹的第二層,最左邊的節點的基尼不純度為 0.5,這似乎表明不純度增大了。但是,每一層應該降低的是基尼不純度的加權平均。每個節點都會根據其樣本占父節點樣本的比例進行加權。所以整體而言,第二層的基尼不純度為:
在最后一層,每個節點的基尼不純度都會達到 0.0,這說明每個節點都只包含單一類別的樣本。這符合我們的預期,因為我們并沒有限制決策樹的深度,讓其可以按需要創建足夠多的層以能分類所有數據點。盡管我們的模型能正確分類所有的訓練數據點,但這并不意味著它就是完美的,因為它與訓練數據可能過擬合了。
二,決策樹案例2
from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np def plot_feature_importances(clf, feature_names):"""可視化分類器中特征的重要性"""c_features = len(feature_names)plt.barh(range(c_features), clf.feature_importances_)plt.xlabel('Feature importance')plt.ylabel('Feature name')plt.yticks(np.arange(c_features), feature_names)plt.show()iris = load_iris()X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=0)max_depth_values = [2, 3, 4]for max_depth_val in max_depth_values:dt_model = DecisionTreeClassifier(max_depth=max_depth_val)dt_model.fit(X_train, y_train)print('max_depth=', max_depth_val)print('訓練集上的準確率: {:.3f}'.format(dt_model.score(X_train, y_train)))print('測試集的準確率: {:.3f}'.format(dt_model.score(X_test, y_test)))print()dt_model = DecisionTreeClassifier(max_depth=4) dt_model.fit(X_train, y_train) print(iris.feature_names) print(dt_model.feature_importances_) plot_feature_importances(dt_model, iris.feature_names)?
隨機森林
隨機森林是由許多決策樹構成的模型。這不僅僅是森林,而且是隨機的,這涉及到兩個概念:
1.隨機采樣數據點
2.基于特征的子集分割節點
隨機采樣
隨機森林的一大關鍵是每個樹都在隨機的數據點樣本上進行訓練。這些樣本是可重復地抽取出來的(稱為 bootstrapping),也就是說某些樣本會多次用于單個樹的訓練(如果有需要,也可以禁止這種做法)。其思路是,通過在不同樣本上訓練每個樹,盡管每個樹依據訓練數據的某個特定子集而可能有較高方差,但整體而言整個森林的方差會很低。這種在數據的不同子集上訓練每個單個學習器然后再求預測結果的平均的流程被稱為 bagging,這是 bootstrap aggregating 的縮寫。
特征的隨機子集
隨機森林背后的另一個概念是:在每個決策樹中,分割每個節點時都只會考慮所有特征中的一個子集。通常設定為 sqrt(n_features),意思是在每個節點,決策樹會基于一部分特征來考慮分割,這部分特征的數量為總特征數量的平方根。隨機森林也可以在每個節點考慮所有特征來進行訓練。(在 Scikit-Learn 隨機森林實現中,這些選項是可調控的。)隨機森林組合了數百或數千個決策樹,并會在稍有不同的觀察集上訓練每個決策樹(數據點是可重復地抽取出來的),并且會根據限定數量的特征分割每個樹中的節點。隨機森林的最終預測結果是每個單個樹的預測結果的平均。
總結
- 上一篇: 数据预测之BP神经网络具体应用以及mat
- 下一篇: ubuntu安装python3.5+py