Adaboost方法分类新闻数据
生活随笔
收集整理的這篇文章主要介紹了
Adaboost方法分类新闻数据
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
使用Adaboost方法,以一級決策樹樹樁(Stump)為基礎(chǔ)建立弱分類器,形式為:(feature, threshold, positive/negtive)。
設(shè)定的最大輪數(shù)為20,最終使用了16輪就使總誤差就降到了0,得到了準確率100%的分類器。
分類的數(shù)據(jù)包括:business,sports,auto三類。
由于寫的Adaboost是二分類模型,所以將語料分為business和非business兩類劃分,分別為1和-1。feature的選取就是分詞的結(jié)果,選擇兩個字以上的詞。threshold也只做是否在文檔中出現(xiàn)的0/1劃分。
最終結(jié)果如下,可以看出,選擇的特征是有重復的,如“體育”特征,被選擇了好幾次,每次的樹樁也相同。在中間訓練過程中,總體的分類正確率也是有波動和反復的。
從選擇的特征詞可以看出,前半部分的特征詞還比較靠譜,后半部分就不一定了,比如“編輯”,“第一”,“來源”什么的,可見模型是有一定的過擬合的。
result stumplist is: [(2404, 0, -1), (32590, 0, -1), (19569, 0, 1), (12171, 0, 1), (29965, 0, -1), (15667, 0, 1), (12171, 0, 1), (32687, 0, -1), (25944, 0, 1), (12171, 0, 1), (32890, 0, -1), (2404, 0, -1), (4840, 0, -1), (15667, 0, 1), (9642, 0, 1), (8630, 0, -1)] result features are: 財經(jīng) 股票 汽車 體育 銀行 編輯 體育 作者 責任 體育 教育 財經(jīng) 指出 編輯 第一 來源單個樹樁的訓練過程中的輸出如下:
-------------------- Train stump round 9 -------------------- >>train featureindex is 0 get new min stump (0, 0, -1) feature is 石塊 , error is 0.408032946166 get new min stump (1, 0, -1) feature is 基建 , error is 0.402227800978 get new min stump (12, 0, -1) feature is 律師 , error is 0.397832761677 get new min stump (25, 0, -1) feature is 合理 , error is 0.385203075673 get new min stump (108, 0, -1) feature is 首席 , error is 0.382699443012 get new min stump (316, 0, -1) feature is 證券 , error is 0.344348559772 >>train featureindex is 1000 >>train featureindex is 2000 get new min stump (2258, 0, -1) feature is 政府 , error is 0.341484530448 get new min stump (2404, 0, -1) feature is 財經(jīng) , error is 0.27536370978 >>train featureindex is 3000 >>train featureindex is 4000 >>train featureindex is 5000 >>train featureindex is 6000 >>train featureindex is 7000 >>train featureindex is 8000 >>train featureindex is 9000 >>train featureindex is 10000 >>train featureindex is 11000 >>train featureindex is 12000 get new min stump (12171, 0, 1) feature is 體育 , error is 0.261865947444 >>train featureindex is 13000 >>train featureindex is 14000 >>train featureindex is 15000 >>train featureindex is 16000 >>train featureindex is 17000 >>train featureindex is 18000 >>train featureindex is 19000 >>train featureindex is 20000 >>train featureindex is 21000 >>train featureindex is 22000 >>train featureindex is 23000 >>train featureindex is 24000 >>train featureindex is 25000 >>train featureindex is 26000 >>train featureindex is 27000 >>train featureindex is 28000 >>train featureindex is 29000 >>train featureindex is 30000 >>train featureindex is 31000 >>train featureindex is 32000 >>train featureindex is 33000 >>train featureindex is 34000 >>train featureindex is 35000 >>train featureindex is 36000 >>train featureindex is 37000 >>train featureindex is 38000 this round stump is (12171, 0, 1) totallabelerror is 0.00293542074364在寫Adaboost程序的時候,要注意一下,在每一輪中,訓練樹樁的時候,根據(jù)新的weightlist來訓練單個樹樁,此時計算單個樹樁的weightlist加權(quán)整體錯誤率,據(jù)此來選擇最好的樹樁;在本輪訓練樹樁結(jié)束后(即已選擇好本輪最好樹樁,跳出來進入下一輪訓練之前),需要計算整個Adaboost模型的分類錯誤率,此時需要將之前的所有樹樁的加權(quán)值(即cm值,或有的材料中叫alpha值)的預測結(jié)果和本輪樹樁的預測結(jié)果疊加,之后使用sign函數(shù)進行分類(之前的總的預測結(jié)果可以保留,這樣每次單獨加上本輪樹樁的更新值即可,無需再從頭計算),就得到了整體的分類正確率。這兩個分類正確率是不同的,分別是為了選擇最優(yōu)樹樁和訓練整個Adaboost模型,前者只需要對新的weightlist進行加權(quán)即可,后者需要考慮之前得到的所有樹樁的預測值。我之前就弄混了,所以寫出來的總是預測的不對。
Python版的Adaboost算法代碼如下:
# /usr/bin/env python # -*- coding: utf-8 -*- from numpy import * import osdef getwordset(doclist):wordset = set(range(0))docwordset = []for doc in doclist:f = open(os.path.join(os.getcwd(), DIR, doc))content = f.read()words = content.split(' ')words = [word.strip() for word in words if len(word) > 4]wordset |= set(words)docwordset.append(list(set(words)))f.close()return list(wordset), docwordsetdef savefeaturelist(featurelist):f = open('featurelist','w')for feature in featurelist:f.write(feature + ' ')f.close()def classifydoclist(stump, doclist, weightlist):labellist = []for i in range(len(doclist)):#print 'classify doc',doclist[i]exist = 0words = docfeaturelist[i]if featurelist[stump[0]] in words: #Notice:'feature in words', NOT 'featureindex in words'exist = 1#append classify doc labelif exist == stump[1]:labellist.append(stump[2]) else:labellist.append(-1 * stump[2])return labellistdef trainstump(doclist, featurelist, weightlist):#stump is (featureindex, threshold, positive/negtive)minstump = (featurelist[0], 0, 0)minerror = 1.0minlabellist = []for featureindex in range(len(featurelist)):if featureindex % 1000 == 0: print '>>train featureindex is', featureindexfor threshold in [0, 1]:for symbol in [-1, 1]:stump = (featureindex, threshold, symbol)#print 'train stump',stumplabellist = classifydoclist(stump, doclist, weightlist)error = float(abs(array(doclabellist) - array(labellist))/2 * mat(weightlist).T)#print 'featureindex',featureindex,'error is',errorif error < minerror:minstump = stumpminerror = errorminlabellist = labellistprint 'get new min stump',stump,'feature is',featurelist[minstump[0]],', error is',errorif minerror == 0.0: breakreturn minstump, minerror, minlabellistdef getcm(error):error = max(error, 1e-16)return log((1.0-error)/error) #sometime this will muptiply 0.5.def updateweightlist(weightlist, cm, labellist):#print 'original weightlist is:\n',weightlist#minus = list(abs(array(doclabellist) - array(labellist))/2)minus = getclassifydiff(labellist)#print 'minus is:\n',minusweightlist = [weightlist[i] * exp(cm * minus[i]) for i in range(len(weightlist))]#print 'new weightlist0 is:\n',weightlistweightlist = [weight/sum(weightlist) for weight in weightlist]#print 'new weightlist is:\n',weightlistreturn weightlistdef sign(plist):result = [-1 for i in range(len(plist))]for i in range(len(plist)):if plist[i] > 0:result[i] = 1return resultdef getclassifydiff(plabellist):return list(abs(array(doclabellist) - array(plabellist))/2)def getclassifyerror(plabellist):#print 'predict labellist is',plabellistminus = getclassifydiff(plabellist)#print 'minus is',minusreturn 1.0 * minus.count(1) / len(plabellist)def traindata(doclist, featurelist):stumplist = []cmlist = []max_k = 20totallabelpredict = array([0.0 for i in range(len(doclabellist))])weightlist = [1.0/len(doclist) for i in range(len(doclist))]for i in range(max_k):print '\n','-' * 20,'Train stump round',i, '-' * 20#print 'new weightlist is',weightliststump, error, labellist = trainstump(doclist, featurelist, weightlist)print 'this round stump is',stumpcm = getcm(error)stumplist.append(stump)cmlist.append(cm)#check total predict result error.#print 'cm is',cmtotallabelpredict += cm * array(labellist)#print 'doclabellist is',doclabellist#print 'totallabelpredict is',totallabelpredicttotallabelerror = getclassifyerror(sign(totallabelpredict))print 'totallabelerror is',totallabelerror#print 'cm is',cmif totallabelerror == 0.0:break#update weight list.weightlist = updateweightlist(weightlist, cm, labellist)print '\n\nTrain data done!'#save model to filemodel = open('Adaboostmodel','w')model.write('cmlist:\n')model.write(str(cmlist)+'\n')model.write('stumplist:\n')model.write(str(stumplist)+'\n')model.write('stump features are:\n')print 'result stumplist is:', stumplistprint 'result features are:'for s in stumplist:print featurelist[s[0]]model.write(str(featurelist[s[0]])+'\n')model.close()print 'save model to file done!'def getdoclabellist(doclist):'''sports is -1, business is 1.two-class classify(business and not-business).'''labellist = [-1 for i in range(len(doclist))]for i in range(len(doclist)):if 'business' in doclist[i]:labellist[i] = 1return labellistdef adaboost():global DIRglobal doclist, featurelist, docfeaturelist, doclabellistDIR = 'news'print 'Arthur adaboost test begin...'print 'doc path DIR is:',DIRdoclist = os.listdir(os.path.join(os.getcwd(), DIR))doclist.sort()print 'total doc size:',len(doclist)#Get doc real label. train stump with this list!doclabellist = getdoclabellist(doclist)featurelist, docfeaturelist = getwordset(doclist)print 'total feature size:',len(featurelist)#train data to get stumps.traindata(doclist, featurelist)if __name__ == '__main__':adaboost()Adaboost樹樁訓練完畢,之后在預測的時候直接使用cm作為每個樹樁的權(quán)重,之后對整體的預測結(jié)果使用sign函數(shù)即可進行分類預測。
總結(jié)
以上是生活随笔為你收集整理的Adaboost方法分类新闻数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 凸优化中如何改进GD方法以防止陷入局部最
- 下一篇: 线性回归中的前提假设