當前位置：首頁 > 编程语言 > python >内容正文

python

【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类

發布時間：2023/12/20 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

決策樹

實現所用到的庫
實現
- 經驗熵計算
- - 經驗熵計算公式
- 條件熵
- 信息增益
- ID3
- - 選擇信息增益最大的屬性
  - 過程
- 擬合
- 預測
- 評估
決策樹可視化
- 決策樹保存
- 決策樹讀取
- 效果圖
總代碼
- 如何獲得每一步計算結果
實驗結果（決策樹）
- debug模式

決策樹(Decision Tree）是在已知各種情況發生概率的基礎上，通過構成決策樹來求取凈現值的期望值大于等于零的概率，評價項目風險，判斷其可行性的決策分析方法，是直觀運用概率分析的一種圖解法。由于這種決策分支畫成圖形很像一棵樹的枝干，故稱決策樹。來源：決策樹_百度百科

數據集使用UCI數據集 Caesarian Section Classification Dataset Data Set

【與數據集相關的詳細信息和下載地址】

本代碼實現了決策樹ID3算法，并使用決策樹ID3算法進行預測。
決策樹算法寫到類中，實現代碼復用，并在使用過程中降低復雜度。
將logging日志等級調整為DEBUG，可以輸出決策樹每一步的詳細過程。
通過使用mermaid的文本繪圖格式對決策樹進行了可視化。

實現所用到的庫

Python 3
Pandas
sklearn（僅用于切分數據集）
numpy

實現

經驗熵計算

熵中的概率由數據估計(特別是最大似然估計)得到時，所對應的熵稱為經驗熵

經驗熵計算公式

$-\sum^n_{i=1}p(x_i)log_2(p(x_i))$

def empirical_entropy(self, dataset=None):"""求經驗熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經驗熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropy

條件熵

條件熵 H(Y∣X)H(Y|X)H(Y∣X)表示在已知隨機變量X的條件下隨機變量Y的不確定性。

定義X給定條件下Y的條件概率分布的熵對X的數學期望：

$\sum_{i=1}^np(i)H(Y|X=x_i)$

信息增益

信息增益表示得知特征X的信息而使得類Y的信息不確定性減少的程度。
即：選擇該特征對分類的幫助程度。

在分類問題困難時，也就是說在訓練數據集經驗熵大的時候，信息增益值會偏大，反之信息增益值會偏小。

使用信息增益比可以對這個問題進行校正，這是特征選擇的另一個標準。

特征A對訓練數據集D的信息增益g(D,A)，定義為集合D的經驗熵H(D)與特征A給定條件下D的經驗條件熵H(D|A)之差：

$g (D, A) = H (D) ? H (D ∣ A)$

ID3

簡單來說，就是不斷選取能夠對分類提供最大效果的屬性，然后根據屬性的各個值選取接下來的最佳屬性

選擇信息增益最大的屬性

因為條件經驗熵越小（表示該分類的結果比較統一，即信息增益越大）表示該屬性對于分類重要性越大。

其中extract_dataset 相當于在符合指定條件下數據集，用于接下來計算條件經驗熵，并獲得信息增益。

def extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據column和label篩選出指定的數據集:return: pd.DataFrame 篩選后的數據集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數據集中的columns中，最好的column(經驗熵最大）:param dataset: 帶選取的數據集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy： {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_column

過程

選取信息增益最大的屬性。

如果各個屬性的最大的信息增益不夠大，即對分類幫助有限，此時直接設定為結果分類中，數量最多的一個值。

如果沒有可以選取的屬性（因為屬性在之前已經選擇完了），此時同樣選取結果數量最多的一個值。

造成沒有可以選取的原因：因為可能同一個屬性，可能有不同結果。

選取當前屬性的各個值，然后分別執行1；

當遞歸完畢，即每個屬性的值最終都有一個值，即為決策樹，如果在測試過程出現訓練階段沒有出現的結果，可以為每一個屬性單獨設置一個其他值用于表示決策樹中沒有該屬性的值時決策樹的輸出結果，這個值可以設置為當前屬性數量最多的結果值。

def id3(self, dataset: pd.DataFrame = None):'''實現決策樹的ID3算法:param dataset: 輸入的數據集:return: dict 決策樹節點'''if dataset is None:dataset = self.DataSetnext_tree = {}result_count = dataset.iloc[:, -1].value_counts()result_max = result_count.idxmax()next_tree["其他"] = result_maxif result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:self._leafCount += 1logging.debug(f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")tree = {"next": next_tree}else:best_column = self.best_empirical_entropy(dataset)value_counts = dataset[best_column].value_counts()labels = value_counts.indexfor label in labels:logging.debug(f"now choose_column:{best_column}, label: {label}")split_dataset = self.extract_dataset(dataset, best_column, label)next_decision = self.id3(split_dataset)next_tree[label] = next_decisiontree = {"column": best_column, "next": next_tree}return tree

擬合

def fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數，輸入數據集進行擬合，其中如果y沒有輸入，則x的最后一列應包含分類結果:param x: pd.DataFrame數據集的屬性（當y為None時，為整個數據集-包含結果）:param y: list like,shape=(-1,)數據集的結果:param algorithm: 選擇算法（目前僅有ID3）:param threshold: 選擇信息增益的閾值:return: 決策樹的根節點'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_tree

預測

def predict(self, x: pd.DataFrame):'''預測數據:param x:pd.DataFrame 輸入的數據集:return: 分類結果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數，用于在predict中，對每一行數據進行預測:param line: 輸入的數據集的某一行數據:return: 該一行的分類結果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]

評估

評估結果的準確度，精確度，召回率。

score評估函數：僅適用于二分類，對于多分類該算法不適用（但是決策樹代碼可以predict預測）
同時score判斷正例需要結果為1，反例結果為0。

def score(self, y):'''評估函數，用于評估結果:param y: 輸入實際的結果:return: None'''if self.y_predict is None:raise Exception("before score should predict first!")y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]y_true = y[y == 1].shape[0]y_total = y.shape[0]logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "f"y_true:{y_true}, y_total:{y_total}")Accuracy = (y_acutalTrue + y_acutalFalse) / y_totalPrecision = y_acutalTrue / y_predictTrueRecall = y_acutalTrue / y_trueprint("Accuracy: ", Accuracy,"Precision: ", Precision,"Recall: ", Recall)

決策樹可視化

利用mermaid文本繪圖，將預測的值做了合并，同一屬性的不同值但是分類結果相同，則可視化時都指向同一個輸出節點。

可視化函數提供了兩種輸出格式
- markdown格式
- html格式（推薦，使用瀏覽器即可查看決策樹）

決策樹保存

def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹已保存，位置:{savePath}")

決策樹讀取

def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")

效果圖

示例圖，非數據集分類結果圖

def visualOutput(self, savePath="", outputFormat="html", direction="TD"):'''將決策樹可視化輸出,格式為‘md'或’html':param outputFormat: 設置輸出格式:return: 對應輸出格式的文本'''if self.decision_tree is None:raise Exception("should fit first!")text = ""if outputFormat == "md":text = self._format_md(direction=direction)elif outputFormat == "html":text = self._format_html(direction=direction)if savePath != "":open(savePath, "w", encoding="utf-8").write(text)return textdef _format_html(self, direction="TD"):'''決策樹的可視化為html格式:return: html代碼'''html_start = '<!DOCTYPE html><html lang="en"><head>' \'<meta charset="UTF-8">' \'<meta name="viewport" content="width=device-width, initial-scale=1.0">' \'<meta http-equiv="X-UA-Compatible" content="ie=edge">' \'<title>DecisionTree</title>' \'<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \'</script></head><body><div class="mermaid">{}</div>'html_end = '</body></html>'mermaid = self._format_md(end=";", direction=direction)mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")html = html_start.replace("{}", mermaid)+html_endreturn htmldef _format_md(self, direction="TD", end="\n"):'''決策樹的可視化為md代碼（mermaid代碼）:param end: 設置每行結尾符號:param direction: 設置方向:return:'''md = "```mermaid\n"md += f"graph {direction}{end}"total_node = 1current_nodeID = 0if len(self.decision_tree) != 2:code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"return md + code_line + "\n```"queue = [self.decision_tree]while len(queue) > 0:node = queue.pop(0)ans_node = []for key in node["next"].keys():if type(node['next'][key]) == dict:if len(node['next'][key]) == 1:decision = node['next'][key]['next']['其他']if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"else:code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"queue.append(node["next"][key])total_node += 1else:decision = node['next'][key]if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"# code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")md += code_line+endcurrent_nodeID += 1return md + "```"

總代碼

如何獲得每一步計算結果

不想要那么多過程，可以將開頭的logging.basicConfig中的level設置為INFO即可。

即：
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

修改為：
logging.basicConfig(level=logging.INFO, format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

如果需要導出日志：
　參數filename為輸出日志位置。
　參數filemode為輸出日志寫入模式。
logging.basicConfig(level=logging.DEBUG, filename='DecisionTree.log', filemode='w', format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s")

運行代碼可能存在問題

數據集不對：Caesarian Section Classification Dataset下載后為arff格式，該代碼使用的數據集格式為csv，需要將arff中的數據提取出來，可以使用記事本，將arff的數據部分保存為csv格式即可。
此外本代碼提供一個demo，無需外部數據集亦可運行。
score評估函數：僅適用于二分類，對于多分類該算法不適用（決策樹可以predict），同時score判斷正例需要結果為1，反例結果為0。

import pandas as pd import numpy as np import logging from sklearn.model_selection import train_test_splitlogging.basicConfig(level=logging.DEBUG,format="%(asctime)s-[%(name)s]\t[%(levelname)s]\t[%(funcName)s]: %(message)s") """ application: Decision_tree-ID3 writer: Flysky Date: 2020年10月14日 """class DecisionTree:def __init__(self):self.DataSet = Noneself._threshold = 0.1self._leafCount = 0self.decision_tree = Noneself.y_predict = Nonedef check_dataset(self, dataset: pd.DataFrame, dimension=2):if len(dataset.shape) != dimension:raise ValueError(f"data dimension not {dimension} but {len(dataset.shape)}")def empirical_entropy(self, dataset=None):"""求經驗熵$$H = -\sum^n_{i=1}p(x_i)log_2(p(x_i))$$:return: Float 經驗熵"""if dataset is None:dataset = self.DataSetcolumns_count = dataset.iloc[:, -1].value_counts()entropy = 0total_count = columns_count.sum()for count in columns_count:p = count / total_countentropy -= p * np.log2(p)return entropydef extract_dataset(self, dataset: pd.DataFrame, column, label):"""根據column和label篩選出指定的數據集:return: pd.DataFrame 篩選后的數據集"""if type(column) == int:split_dataset = dataset[dataset.iloc[:, column] == label].drop(dataset.columns[column], axis=1)else:split_dataset = dataset[dataset.loc[:, column] == label].drop(column, axis=1)return split_datasetdef best_empirical_entropy(self, dataset: pd.DataFrame = None):"""選取數據集中的columns中，最好的column(經驗熵最大）:param dataset: 帶選取的數據集:return: 返回column"""if dataset is None:dataset = self.DataSetcolumns = dataset.columns[:-1]total_count = dataset.shape[0]empirical_entropy = self.empirical_entropy(dataset)logging.debug(f"now dataset shape is {dataset.shape}, column is {dataset.columns.tolist()}")logging.debug(f"empirical_entropy is {empirical_entropy}")informationGain_max = -1best_column = Nonefor column in columns:entropy_tmp = 0data_counts = dataset.loc[:, column].value_counts()data_labels = data_counts.indexlogging.debug(f"now is {column}")for label in data_labels:split_dataset = self.extract_dataset(dataset, column, label)count = split_dataset.shape[0]p = count / total_countentropy_tmp += p * self.empirical_entropy(split_dataset)logging.debug(f"now label is {label}, chooseData shape is {split_dataset.shape}, "f"Ans count: {split_dataset.iloc[:, -1].value_counts().tolist()}, "f"entropy: {self.empirical_entropy(split_dataset)}")informationGain = empirical_entropy - entropy_tmplogging.debug(f"entropy： {entropy_tmp}, {column} informationGain:{informationGain}")if informationGain > informationGain_max:best_column = columninformationGain_max = informationGainlogging.debug(f"Choose {best_column}:{informationGain_max}")return best_columndef id3(self, dataset: pd.DataFrame = None):'''實現決策樹的ID3算法:param dataset: 輸入的數據集:return: dict 決策樹節點'''if dataset is None:dataset = self.DataSetnext_tree = {}result_count = dataset.iloc[:, -1].value_counts()result_max = result_count.idxmax()next_tree["其他"] = result_maxif result_count.shape[0] == 1 or dataset.shape[1] < 2 or self.empirical_entropy(dataset) < self._threshold:self._leafCount += 1logging.debug(f"select decision {result_max}, result_type:{result_count.tolist()}, dataset column:{dataset.shape}, lower than threshold:{self.empirical_entropy(dataset) < self._threshold}")tree = {"next": next_tree}else:best_column = self.best_empirical_entropy(dataset)value_counts = dataset[best_column].value_counts()labels = value_counts.indexfor label in labels:logging.debug(f"now choose_column:{best_column}, label: {label}")split_dataset = self.extract_dataset(dataset, best_column, label)next_decision = self.id3(split_dataset)next_tree[label] = next_decisiontree = {"column": best_column, "next": next_tree}return treedef fit(self, x: pd.DataFrame, y=None, algorithm: str = "id3", threshold=0.1):'''擬合函數，輸入數據集進行擬合，其中如果y沒有輸入，則x的最后一列應包含分類結果:param x: pd.DataFrame數據集的屬性（當y為None時，為整個數據集-包含結果）:param y: list like,shape=(-1,)數據集的結果:param algorithm: 選擇算法（目前僅有ID3）:param threshold: 選擇信息增益的閾值:return: 決策樹的根節點'''self.check_dataset(x, dimension=2)self.check_dataset(y, dimension=1)self._threshold = thresholddataset = xif y is not None:dataset.insert(dataset.shape[1], 'DECISION_tempADD', y)self.decision_tree = eval("self." + algorithm)(dataset)logging.info(f"decision_tree leaf:{self._leafCount}")return self.decision_treedef leaf_count(self):'''統計葉子節點個數（此處的葉子節點即能確定分類的屬性值所對應的分類結果值:return: 葉子節點個數'''return self._leafCountdef predict(self, x: pd.DataFrame):'''預測數據:param x:pd.DataFrame 輸入的數據集:return: 分類結果'''self.y_predict = x.apply(self._predict_line, axis=1)return self.y_predictdef _predict_line(self, line):"""私有函數，用于在predict中，對每一行數據進行預測:param line: 輸入的數據集的某一行數據:return: 該一行的分類結果"""tree = self.decision_treewhile True:try:if len(tree["next"]) == 1:return tree["next"]["其他"]else:value = line[tree["column"]]tree = tree["next"][value]except:return tree["next"]["其他"]def score(self, y):'''評估函數，用于評估結果:param y: 輸入實際的結果:return: None'''if self.y_predict is None:raise Exception("before score should predict first!")y_acutalTrue = y[(y == 1) & (self.y_predict == 1)].shape[0]y_acutalFalse = y[(y == 0) & (self.y_predict == 0)].shape[0]y_predictTrue = self.y_predict[self.y_predict == 1].shape[0]y_true = y[y == 1].shape[0]y_total = y.shape[0]logging.debug(f"y_acutalTrue:{y_acutalTrue}, y_acutalFalse:{y_acutalFalse}, y_predictTrue:{y_predictTrue}, "f"y_true:{y_true}, y_total:{y_total}")Accuracy = (y_acutalTrue + y_acutalFalse) / y_totalPrecision = y_acutalTrue / y_predictTrueRecall = y_acutalTrue / y_trueprint("Accuracy: ", Accuracy,"Precision: ", Precision,"Recall: ", Recall)def visualOutput(self, savePath="", outputFormat="html", direction="TD"):'''將決策樹可視化輸出,格式為‘md'或’html':param outputFormat: 設置輸出格式:return: 對應輸出格式的文本'''if self.decision_tree is None:raise Exception("should fit first!")text = ""if outputFormat == "md":text = self._format_md(direction=direction)elif outputFormat == "html":text = self._format_html(direction=direction)if savePath != "":open(savePath, "w", encoding="utf-8").write(text)return textdef _format_html(self, direction="TD"):'''決策樹的可視化為html格式:return: html代碼'''html_start = '<!DOCTYPE html><html lang="en"><head>' \'<meta charset="UTF-8">' \'<meta name="viewport" content="width=device-width, initial-scale=1.0">' \'<meta http-equiv="X-UA-Compatible" content="ie=edge">' \'<title>DecisionTree</title>' \'<script src="https://cdn.bootcss.com/mermaid/8.0.0-rc.8/mermaid.min.js">' \'</script></head><body><div class="mermaid">{}</div>'html_end = '</body></html>'mermaid = self._format_md(end=";", direction=direction)mermaid = mermaid.replace("```mermaid\n", "").replace("```", "")html = html_start.replace("{}", mermaid) + html_endreturn htmldef _format_md(self, direction="TD", end="\n"):'''決策樹的可視化為md代碼（mermaid代碼）:param end: 設置每行結尾符號:param direction: 設置方向:return:'''md = "```mermaid\n"md += f"graph {direction}{end}"total_node = 1current_nodeID = 0if len(self.decision_tree) != 2:code_line = f"{current_nodeID}(start)-->{self.decision_tree['next']['其他']}"return md + code_line + "\n```"queue = [self.decision_tree]while len(queue) > 0:node = queue.pop(0)ans_node = []for key in node["next"].keys():if type(node['next'][key]) == dict:if len(node['next'][key]) == 1:decision = node['next'][key]['next']['其他']if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"else:code_line = f"{current_nodeID}({node['column']})--{key}-->{total_node}"queue.append(node["next"][key])total_node += 1else:decision = node['next'][key]if decision not in ans_node:ans_node.append(decision)nodeID_ans = ans_node.index(decision)code_line = f"{current_nodeID}({node['column']})--{key}-->" \f"L{current_nodeID}_{nodeID_ans}({decision})"# code_line_b = str(code_line.encode("utf-8")).lstrip("b'").rstrip("'")md += code_line + endcurrent_nodeID += 1return md + "```"def load(self, savePath: str):tree = eval(open(savePath, "r").read())if type(tree) == dict:self.decision_tree = treeelse:raise Exception("Load Faild!")def save(self, savePath: str):open(savePath, "w").write(str(decisionTree.decision_tree))logging.info(f"決策樹已保存，位置:{savePath}")if __name__ == '__main__':# 初始化決策樹decisionTree = DecisionTree()# 不需要外部數據集的demodemo_data = [[0, 2, 0, 0, 0],[0, 2, 0, 1, 0],[1, 2, 0, 0, 1],[2, 1, 0, 0, 1],[2, 0, 1, 0, 1],[2, 0, 1, 1, 0],[1, 0, 1, 1, 1],[0, 1, 0, 0, 0],[0, 0, 1, 0, 1],[2, 1, 1, 0, 1],[0, 1, 1, 1, 1],[1, 1, 0, 1, 1],[1, 2, 1, 0, 1],[2, 1, 0, 1, 0]]dataset = pd.DataFrame(demo_data)dataset.columns = ['年齡', '有工作', '是學生', '信貸情況', "借貸"]# UCI數據集Caesarian Section Classification# dataset = pd.read_csv("caesarian.csv", header=None)# dataset.columns = ["Age", "Delivery_number", "Delivery_time", "Blood_of_Pressure", "Heart_Problem", "Caesarian"]# age = dataset["Age"].value_counts().sort_index() # 將Age分為三層，低于24歲，低于31歲，高于30歲# dataset["Age"][dataset["Age"] < 24] = 0# dataset["Age"][(dataset["Age"] > 23) & (dataset["Age"] < 31)] = 1# dataset["Age"][30 < dataset["Age"]] = 2# print(dataset.info())# 將數據集的屬性和結果分開X = dataset.iloc[:, :-1]Y = dataset.iloc[:, -1]# 使用skleran切分數據集# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.7, shuffle=True)# else直接使用數據集作為測試集X_train = X_test = XY_train = Y_test = Y# 擬合e = decisionTree.fit(X_train, Y_train, threshold=-1)# 保存決策樹decisionTree.save("decisionTree.txt")# 加載決策樹decisionTree.load("decisionTree.txt")# 預測predict_y = decisionTree.predict(X_test)# 評估decisionTree.score(Y_test)# 可視化輸出（html格式）# visualOutput可選參數outputFormat=["md", "html"],direction方向，設置決策樹的方向=["LR","RL","TD","DT"],默認TD，從上到下decisionTree.visualOutput(savePath="decisionTree.html", outputFormat="html")

實驗結果（決策樹）

debug模式

使用demo數據集運行

2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (14, 5), column is ['年齡', '有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,827-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9402859586706311 2020-10-14 00:47:19,831-[root] [DEBUG] [best_empirical_entropy]: now is 年齡 2020-10-14 00:47:19,849-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,859-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (5, 4), Ans count: [3, 2], entropy: 0.9709505944546686 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (4, 4), Ans count: [4], entropy: 0.0 2020-10-14 00:47:19,865-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.6935361388961918, 年齡 informationGain:0.24674981977443933 2020-10-14 00:47:19,868-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,880-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [4, 2], entropy: 0.9182958340544896 2020-10-14 00:47:19,889-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (4, 4), Ans count: [2, 2], entropy: 1.0 2020-10-14 00:47:19,896-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (4, 4), Ans count: [3, 1], entropy: 0.8112781244591328 2020-10-14 00:47:19,897-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9110633930116763, 有工作 informationGain:0.02922256565895487 2020-10-14 00:47:19,898-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:19,909-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (7, 4), Ans count: [6, 1], entropy: 0.5916727785823275 2020-10-14 00:47:19,917-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (7, 4), Ans count: [4, 3], entropy: 0.9852281360342515 2020-10-14 00:47:19,918-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.7884504573082896, 是學生 informationGain:0.15183550136234159 2020-10-14 00:47:19,920-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:19,927-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (8, 4), Ans count: [6, 2], entropy: 0.8112781244591328 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (6, 4), Ans count: [3, 3], entropy: 1.0 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.8921589282623617, 信貸情況 informationGain:0.04812703040826949 2020-10-14 00:47:19,937-[root] [DEBUG] [best_empirical_entropy]: Choose 年齡:0.24674981977443933 2020-10-14 00:47:19,940-[root] [DEBUG] [id3]: now choose_column:年齡, label: 2 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:19,950-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:19,953-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:19,964-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,974-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 有工作 informationGain:0.01997309402197489 2020-10-14 00:47:19,976-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:19,983-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:19,992-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 是學生 informationGain:0.01997309402197489 2020-10-14 00:47:19,995-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,004-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.0, 信貸情況 informationGain:0.9709505944546686 2020-10-14 00:47:20,013-[root] [DEBUG] [best_empirical_entropy]: Choose 信貸情況:0.9709505944546686 2020-10-14 00:47:20,015-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 0 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: select decision 1, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,021-[root] [DEBUG] [id3]: now choose_column:信貸情況, label: 1 2020-10-14 00:47:20,027-[root] [DEBUG] [id3]: select decision 0, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,028-[root] [DEBUG] [id3]: now choose_column:年齡, label: 0 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: now dataset shape is (5, 4), column is ['有工作', '是學生', '信貸情況', 'DECISION_tempADD'] 2020-10-14 00:47:20,037-[root] [DEBUG] [best_empirical_entropy]: empirical_entropy is 0.9709505944546686 2020-10-14 00:47:20,038-[root] [DEBUG] [best_empirical_entropy]: now is 有工作 2020-10-14 00:47:20,046-[root] [DEBUG] [best_empirical_entropy]: now label is 2, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,052-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (1, 3), Ans count: [1], entropy: 0.0 2020-10-14 00:47:20,060-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.4, 有工作 informationGain:0.5709505944546686 2020-10-14 00:47:20,061-[root] [DEBUG] [best_empirical_entropy]: now is 是學生 2020-10-14 00:47:20,068-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [3], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [2], entropy: 0.0 2020-10-14 00:47:20,076-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.0, 是學生 informationGain:0.9709505944546686 2020-10-14 00:47:20,077-[root] [DEBUG] [best_empirical_entropy]: now is 信貸情況 2020-10-14 00:47:20,085-[root] [DEBUG] [best_empirical_entropy]: now label is 0, chooseData shape is (3, 3), Ans count: [2, 1], entropy: 0.9182958340544896 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: now label is 1, chooseData shape is (2, 3), Ans count: [1, 1], entropy: 1.0 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: entropy： 0.9509775004326937, 信貸情況 informationGain:0.01997309402197489 2020-10-14 00:47:20,092-[root] [DEBUG] [best_empirical_entropy]: Choose 是學生:0.9709505944546686 2020-10-14 00:47:20,094-[root] [DEBUG] [id3]: now choose_column:是學生, label: 0 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: select decision 0, result_type:[3], dataset column:(3, 3), lower than threshold:False 2020-10-14 00:47:20,100-[root] [DEBUG] [id3]: now choose_column:是學生, label: 1 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: select decision 1, result_type:[2], dataset column:(2, 3), lower than threshold:False 2020-10-14 00:47:20,106-[root] [DEBUG] [id3]: now choose_column:年齡, label: 1 2020-10-14 00:47:20,112-[root] [DEBUG] [id3]: select decision 1, result_type:[4], dataset column:(4, 4), lower than threshold:False 2020-10-14 00:47:20,112-[root] [INFO] [fit]: decision_tree leaf:5 2020-10-14 00:47:20,113-[root] [INFO] [save]: 決策樹已保存，位置:decisionTree.txt 2020-10-14 00:47:20,123-[root] [DEBUG] [score]: y_acutalTrue:9, y_acutalFalse:5, y_predictTrue:9, y_true:9, y_total:14

總結

以上是生活随笔為你收集整理的【Python机器学习】决策树ID3算法结果可视化附源代码对UCI数据集Caesarian Section进行分类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Error Domain=NSCocoa
下一篇：论文阅读：Dual Reader-Par