生活随笔
收集整理的這篇文章主要介紹了
机器学习基础——实现基本的决策树
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
一、決策樹基本流程
?決策樹是常見的機器學習方法。我在學習周志華的機器學習的時候,用python實現了最基礎的ID3算法,其基本思想是:
基于信息論中的信息增益理論,首先找出判斷樣本的最高的信息增益的屬性(或者說特征),然后依次對按該屬性劃分成的子集
進行同樣的選擇信息增益最大的屬性并進行劃分的過程。這是一個典型的(divide-conquer)的過程。
二、最基本的決策樹實現
下面,通過周志華《機器學習》(也叫西瓜書)第四章的例子,我用python實現了基于pandas的DataFrame的數據結構的數據來進行的決策樹的實現(屬性都是離散屬性)。對連續屬性或更一般性的情況,可以調用sklearn的決策樹包來做,這塊放在最后。樣本如下:[python]?view plain
?copy??????????import?pandas?as?pd??import?numpy?as?np??import?os??from?math?import?log2????def?Entrop(data):??????entropy?=?dict()??????for?i?in?data.keys():??????????good?=?data[i].count('是')??????????bad?=?data[i].count('否')??????????length?=?len(data[i])??????????p1?=?good?/?length??????????p2?=?bad?/?length??????????if?p1?==?0.0?or?p2?==?0.0:??????????????entropy[i]?=?0??????????else:??????????????entropy[i]?=?-(p1?*?log2(p1)?+?p2?*?log2(p2))??????return?entropy????def?DecisionTree(entropy,?data,?MaxKey,?threshold,?category):??????for?i?in?entropy:??????????sub_data?=?data[data[MaxKey]?==?i]????????????subs_entropy?=?entropy[i]????????????sizes?=?sub_data.count(0)[0]????????????if?subs_entropy?>=?threshold:??????????????gains?=?dict()????????????????data_sample?=?dict()??????????????Ent?=?dict()????????????????for?j?in?category:??????????????????data_sample[j]?=?sub_data['好瓜'].groupby(sub_data[j]).sum()??????????????????nn?=?len(data_sample[j])??????????????????he?=?0??????????????????for?m?in?range(nn):??????????????????????good?=?data_sample[j][m].count('是')??????????????????????bad?=?data_sample[j][m].count('否')??????????????????????length?=?len(data_sample[j][m])??????????????????????p1?=?good?/?length??????????????????????p2?=?bad?/?length??????????????????????if?good?==?0.0?or?bad?==?0.0?or?length?==?0:??????????????????????????Ent[j]?=?0??????????????????????else:??????????????????????????Ent[j]?=?-(p1?*?log2(p1)?+?p2?*?log2(p2))??????????????????????he?+=?(length?*?Ent[j])?/?sizes??????????????????gains[j]?=?subs_entropy?-?he??????????????if?len(gains)?>?0:??????????????????maxKey?=?max(gains.items(),?key=lambda?x:?x[1])[0]??????????????????entropys?=?Entrop(data_sample[maxKey])??????????????????category.pop(maxKey)??????????????????print('{0}下面的第一分類是{1}'.format(i,?maxKey))??????????????????DecisionTree(entropys,?sub_data,?maxKey,?threshold,?category)??????????????else:??????????????????highest_class?=?max(sub_data['好瓜'].values)??????????????????if?highest_class?==?'否':??????????????????????print('{0}下面的瓜都是壞瓜'.format(i))??????????????????else:??????????????????????print('{0}下面的瓜都是好瓜'.format(i))??????????else:??????????????highest_class?=?max(sub_data['好瓜'].values)??????????????if?highest_class?==?'否':??????????????????print('{0}下面的瓜都是壞瓜'.format(i))??????????????else:??????????????????print('{0}下面的瓜都是好瓜'.format(i))????????def?main():??????????????dataset_path?=?'./datasets'????????filename?=?'watermalon_simple.xlsx'????????dataset_filepath?=?os.path.join(dataset_path,?filename)??????????????????data?=?pd.read_excel(dataset_filepath)???????header?=?data.columns???????sums?=?data.count()[0]????????????????????????????category?=?dict()??????for?i?in?header[1:-1]:??????????category[i]?=?set(data[i])????????????????????Ent?=?dict()??????number?=?data['好瓜'].groupby(data['好瓜']).sum()??????p1?=?len(number['是'])?/?sums????????p2?=?len(number['否'])?/?sums????????Ent["總"]?=?-(p1*log2(p1)?+?p2*log2(p2))??????????????Gain?=?dict()??????data_sample?=?dict()??????for?i?in?category:??????????data_sample[i]?=?data['好瓜'].groupby(data[i]).sum()????????????????????n?=?category[i]??????????????????????????????he?=?0??????????for?j?in?range(len(n)):??????????????good?=?data_sample[i][j].count('是')??????????????bad?=?data_sample[i][j].count('否')??????????????length?=?len(data_sample[i][j])??????????????p1?=?good?/?length??????????????p2?=?bad?/?length??????????????if?p1?==?0.0?or?p2?==?0.0:??????????????????Ent[j]?=?0??????????????else:??????????????????Ent[j]?=?-(p1?*?log2(p1)?+?p2?*?log2(p2))??????????????he?+=?(length?*?Ent[j])/sums??????????Gain[i]?=?Ent['總']?-?he????????????????????MaxKey?=?max(Gain.items(),?key=lambda?x:?x[1])[0]??????print('主分類是{}'.format(MaxKey))????????????????????entropy?=?Entrop(data_sample[MaxKey])??????print(entropy)????????????????????category.pop(str(MaxKey))????????????????????????????threshold?=?0.05??????DecisionTree(entropy,?data,?MaxKey,?threshold,?category)??????????????if?__name__?==?"__main__":??????main()??上面是我的代碼,如果有不清楚的地方可以交流。下面貼上用sklearn的代碼:
[python]?view plain
?copyfrom?sklearn?import?tree??import?numpy?as?np??from?sklearn.feature_extraction?import?DictVectorizer????from?sklearn.cross_validation?import?train_test_split??import?xlrd??workbook?=?xlrd.open_workbook('watermalon_simple.xlsx')??booksheet?=?workbook.sheet_by_name('Sheet1')??X?=?list()??Y?=?list()??for?row?in?range(booksheet.nrows):??????????row_data?=?[]??????????ll?=?dict()????????????????????mm?=?list()??????????for?col?in?range(1,?booksheet.ncols-1):??????????????cel?=?booksheet.cell(row,?col)??????????????val?=?cel.value??????????????cat?=?['色澤',?'根蒂',?'敲聲',?'紋理',?'觸感']????????????????????????????ll[cat[col-1]]?=?val??????????for?col?in?range(booksheet.ncols-1,booksheet.ncols):??????????????cel?=?booksheet.cell(row,?col)??????????????val?=?cel.value??????????????mm.append(val)??????????print('第{}行讀取完畢'.format(row))??????????X.append(ll)??????????Y.append(mm)????X.pop(0);?Y.pop(0)????print(X);?print(Y)????y?=?[1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]??????dummyY?=?np.array(y)??vec?=?DictVectorizer()??dummyX?=?vec.fit_transform(X).toarray()????????x_train,?x_test,?y_train,?y_test?=?train_test_split(dummyX,?dummyY,?test_size=0.3)??????clf?=?tree.DecisionTreeClassifier(criterion="entropy")????clf?=?clf.fit(x_train,?y_train)????????answer=clf.predict(x_test)??print(u'系統進行測試')????print(answer)??print(y_test)????????with?open("alalala.dot",?"w")?as?f:???????f?=?tree.export_graphviz(clf,?feature_names=vec.get_feature_names(),?out_file?=?f)??
總結
以上是生活随笔為你收集整理的机器学习基础——实现基本的决策树的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。