【Python机器学习】决策树ID3算法结果可视化附源代码 对UCI数据集Caesarian Section进行分类
決策樹
- 實現所用到的庫
- 實現
- 經驗熵計算
- 經驗熵計算公式
- 條件熵
- 信息增益
- ID3
- 選擇信息增益最大的屬性
- 過程
- 擬合
- 預測
- 評估
- 決策樹可視化
- 決策樹保存
- 決策樹讀取
- 效果圖
- 總代碼
- 如何獲得每一步計算結果
- 實驗結果(決策樹)
- debug模式
決策樹(Decision Tree)是在已知各種情況發生概率的基礎上,通過構成決策樹來求取凈現值的期望值大于等于零的概率,評價項目風險,判斷其可行性的決策分析方法,是直觀運用概率分析的一種圖解法。由于這種決策分支畫成圖形很像一棵樹的枝干,故稱決策樹。 來源:決策樹_百度百科
數據集使用UCI數據集 Caesarian Section Classification Dataset Data Set
【與數據集相關的詳細信息和下載地址】
- 本代碼實現了決策樹ID3算法,并使用決策樹ID3算法進行預測。
- 決策樹算法寫到類中,實現代碼復用,并在使用過程中降低復雜度。
- 將logging日志等級調整為DEBUG,可以輸出決策樹每一步的詳細過程。
- 通過使用mermaid的文本繪圖格式對決策樹進行了可視化。
實現所用到的庫
- Python 3
- Pandas
- sklearn(僅用于切分數據集)
- numpy
實現
經驗熵計算
熵中的概率由數據估計(特別是最大似然估計)得到時,所對應的熵稱為經驗熵
經驗熵計算公式
H=?∑i=1np(xi)log2(p(xi))H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))H=?i=1∑n?p(xi?)log2?(p(xi?))
def empirical_entropy(self, dataset=None):"""求經驗熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經驗熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropy條件熵
條件熵 H(Y∣X)H(Y|X)H(Y∣X)表示在已知隨機變量X的條件下隨機變量Y的不確定性。
定義X給定條件下Y的條件概率分布的熵對X的數學期望:
H(Y∣X)=∑i=1np(i)H(Y∣X=xi)H(Y|X) = \sum_{i=1}^np(i)H(Y|X=x_i)H(Y∣X)=i=1∑n?p(i)H(Y∣X=xi?)
信息增益
信息增益表示得知特征X的信息而使得類Y的信息不確定性減少的程度。
即:選擇該特征對分類的幫助程度。
在分類問題困難時,也就是說在訓練數據集經驗熵大的時候,信息增益值會偏大,反之信息增益值會偏小。
使用信息增益比可以對這個問題進行校正,這是特征選擇的另一個標準。
特征A對訓練數據集D的信息增益g(D,A),定義為集合D的經驗熵H(D)與特征A給定條件下D的經驗條件熵H(D|A)之差:
g(D,A)=H(D)?H(D∣A)g(D,A) = H(D)-H(D|A)g(D,A)=H(D)?H(D∣A)
ID3
簡單來說,就是不斷選取能夠對分類提供最大效果的屬性,然后根據屬性的各個值選取接下來的最佳屬性
選擇信息增益最大的屬性
因為條件經驗熵越小(表示該分類的結果比較統一,即信息增益越大)表示該屬性對于分類重要性越大。
其中extract_dataset 相當于在符合指定條件下數據集,用于接下來計算條件經驗熵,并獲得信息增益。
def extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據column和label篩選出指定的數據集:return: pd.DataFrame 篩選后的數據集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數據集中的columns中,最好的column(經驗熵最大):param dataset: 帶選取的數據集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy: {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_column過程
造成沒有可以選取的原因:因為可能同一個屬性,可能有不同結果。
擬合
def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數,輸入數據集進行擬合,其中如果y沒有輸入,則x的最后一列應包含分類結果:param x: pd.DataFrame數據集的屬性(當y為None時,為整個數據集-包含結果):param y: list like,shape=(-1,)數據集的結果:param algorithm: 選擇算法(目前僅有ID3):param threshold: 選擇信息增益的閾值:return: 決策樹的根節點'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_tree預測
def predict(self, x: pd.DataFrame):'''預測數據:param x:pd.DataFrame 輸入的數據集:return: 分類結果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數,用于在predict中,對每一行數據進行預測:param line: 輸入的數據集的某一行數據:return: 該一行的分類結果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]評估
評估結果的準確度,精確度,召回率。
- score評估函數:僅適用于二分類,對于多分類該算法不適用(但是決策樹代碼可以predict預測)
- 同時score判斷正例需要結果為1,反例結果為0。
決策樹可視化
利用mermaid文本繪圖,將預測的值做了合并,同一屬性的不同值但是分類結果相同,則可視化時都指向同一個輸出節點。
- 可視化函數提供了兩種輸出格式
- markdown格式
- html格式(推薦,使用瀏覽器即可查看決策樹)
決策樹保存
def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹已保存,位置:{savePath}")決策樹讀取
def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")效果圖
示例圖,非數據集分類結果圖
總代碼
如何獲得每一步計算結果
不想要那么多過程,可以將開頭的logging.basicConfig中的level設置為INFO即可。
即:
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
修改為:
logging.basicConfig(level=logging.INFO, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
如果需要導出日志:
參數filename為輸出日志位置。
參數filemode為輸出日志寫入模式。
logging.basicConfig(level=logging.DEBUG, filename='DecisionTree.log', filemode='w', format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")
運行代碼可能存在問題
- 數據集不對:Caesarian Section Classification Dataset下載后為arff格式,該代碼使用的數據集格式為csv,需要將arff中的數據提取出來,可以使用記事本,將arff的數據部分保存為csv格式即可。
- 此外本代碼提供一個demo,無需外部數據集亦可運行。
- score評估函數:僅適用于二分類,對于多分類該算法不適用(決策樹可以predict),同時score判斷正例需要結果為1,反例結果為0。
實驗結果(決策樹)
debug模式
使用demo數據集運行
2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (14, 5), column is ['年齡', '有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9402859586706311 2020-10-14 00:47:19,831-[root] [DEBUG] [best_empirical_entropy]: now is 年齡 2020-10-14 00:47:19,849-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,859-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (4, 4), Ans count: [4], entropy: 0.0 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.6935361388961918, 年齡 informationGain:0.24674981977443933 2020-10-14 00:47:19,868-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,880-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [4, 2], entropy: 0.9182958340544896 2020-10-14 00:47:19,889-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (4, 4), Ans count: [2, 2], entropy: 1.0 2020-10-14 00:47:19,896-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (4, 4), Ans count: [3, 1], entropy: 0.8112781244591328 2020-10-14 00:47:19,897-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9110633930116763, 有工作 informationGain:0.02922256565895487 2020-10-14 00:47:19,898-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:19,909-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (7, 4), Ans count: [6, 1], entropy: 0.5916727785823275 2020-10-14 00:47:19,917-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (7, 4), Ans count: [4, 3], entropy: 0.9852281360342515 2020-10-14 00:47:19,918-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.7884504573082896, 是學生 informationGain:0.15183550136234159 2020-10-14 00:47:19,920-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:19,927-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (8, 4), Ans count: [6, 2], entropy: 0.8112781244591328 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [3, 3], entropy: 1.0 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.8921589282623617, 信貸情況 informationGain:0.04812703040826949 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: Choose 年齡:0.24674981977443933 2020-10-14 00:47:19,940-[root] [DEBUG] [id3]: now choose_column:年齡, label: 2 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:19,953-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,964-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 有工作 informationGain:0.01997309402197489 2020-10-14 00:47:19,976-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:19,983-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 是學生 informationGain:0.01997309402197489 2020-10-14 00:47:19,995-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,004-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.0, 信貸情況 informationGain:0.9709505944546686 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: Choose 信貸情況:0.9709505944546686 2020-10-14 00:47:20,015-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 0 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: select decision 1, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 1 2020-10-14 00:47:20,027-[root] [DEBUG] [id3]: select decision 0, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,028-[root] [DEBUG] [id3]: now choose_column:年齡, label: 0 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:20,038-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:20,046-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,052-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (1, 3), Ans count: [1], entropy: 0.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.4, 有工作 informationGain:0.5709505944546686 2020-10-14 00:47:20,061-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:20,068-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.0, 是學生 informationGain:0.9709505944546686 2020-10-14 00:47:20,077-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,085-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: entropy: 0.9509775004326937, 信貸情況 informationGain:0.01997309402197489 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: Choose 是學生:0.9709505944546686 2020-10-14 00:47:20,094-[root] [DEBUG] [id3]: now choose_column:是學生, label: 0 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: select decision 0, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: now choose_column:是學生, label: 1 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: select decision 1, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: now choose_column:年齡, label: 1 2020-10-14 00:47:20,112-[root] [DEBUG] [id3]: select decision 1, result_type:[4], dataset column:(4, 4), lower than threshold:False 2020-10-14 00:47:20,112-[root] [INFO] [fit]: decision_tree leaf:5 2020-10-14 00:47:20,113-[root] [INFO] [save]: 決策樹已保存,位置:decisionTree.txt 2020-10-14 00:47:20,123-[root] [DEBUG] [score]: y_acutalTrue:9, y_acutalFalse:5, y_predictTrue:9, y_true:9, y_total:14總結
以上是生活随笔為你收集整理的【Python机器学习】决策树ID3算法结果可视化附源代码 对UCI数据集Caesarian Section进行分类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Error Domain=NSCocoa
- 下一篇: 论文阅读:Dual Reader-Par