當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习算法-Adaboost

發(fā)布時間：2025/3/17 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习算法-Adaboost 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本章內(nèi)容

組合類似的分類器來提高分類性能
應(yīng)用AdaBoost算法
處理非均衡分類問題

主題：利用AdaBoost元算法提高分類性能

1.基于數(shù)據(jù)集多重抽樣的分類器

-AdaBoost

長處	泛化錯誤率低，易編碼，能夠應(yīng)用在大部分分類器上，無需參數(shù)調(diào)整
缺點	對離群點敏感
適合數(shù)據(jù)類型	數(shù)值型和標(biāo)稱型數(shù)據(jù)

bagging:基于數(shù)據(jù)隨機重抽樣的分類器構(gòu)建方法

自舉匯聚法(bootstrap aggregating),也稱為bagging方法，是從原始數(shù)據(jù)集選擇S次后得到S個新數(shù)據(jù)集的一種技術(shù)。

新數(shù)據(jù)集和原始數(shù)據(jù)集的大小相等。每一個數(shù)據(jù)集都是通過在原始數(shù)據(jù)集中隨機選擇一個本來進行替換而得到的。

在S個數(shù)據(jù)集建好之后，將某個學(xué)習(xí)算法分別作用域每一個數(shù)據(jù)集得到了S個分類器。當(dāng)我們對新數(shù)據(jù)進行分類時，就能夠應(yīng)用S個分類器進行分類。與此同一時候，選擇分類器投票結(jié)果最多的類別作為最后的分類結(jié)果。

有一些比較先進的bagging方法，如隨機森林（RF）。

boosting是一種與bagging非常類似的技術(shù)。

不論是boosting還是bagging其中。當(dāng)使用的多個分類器的類型都是一致的。可是在前者其中，不同的分類器是通過串行訓(xùn)練而獲得的。每一個新分類器都依據(jù)已訓(xùn)練出的分類器的性能來進行訓(xùn)練。boosting是通過訓(xùn)練集中關(guān)注被已有分類器錯分的那些數(shù)據(jù)來獲得新的分類器。

boosting方法有多個版本號，當(dāng)前最流行便屬于AdaBoost。

AdaBoost的一般流程

（1）收集數(shù)據(jù)：能夠使用不論什么方法；
（2）準(zhǔn)備數(shù)據(jù)：依賴于所使用的若分類器類型；
（3）分析數(shù)據(jù)：能夠使用隨意方法
（4）訓(xùn)練算法：AdaBoost的大部分時間都用在訓(xùn)練上，分類器將多次在同一數(shù)據(jù)集上訓(xùn)練若分類器。
（5）測試算法：計算分類的錯誤率；
（6）使用算法：同SVM一樣，AdaBoost預(yù)測的兩個類別中的一個。假設(shè)想要把它應(yīng)用到多個類的場合，那么就像多類SVM中的做法一樣對AdaBoost進行改動。

2.訓(xùn)練算法：基于錯誤提升分類器的性能

AdaBoost是adaptive boosting（自適應(yīng)boosting）的縮寫，其執(zhí)行過程：訓(xùn)練集中的每一個樣本，賦予其一個權(quán)重，這些權(quán)重構(gòu)成向量D。一開始，這些權(quán)重都初試化成相等值。首先在訓(xùn)練數(shù)據(jù)上訓(xùn)練處一個若分類器并計算該分類器的錯誤率，然后在同一數(shù)據(jù)集上再次訓(xùn)練若分類器。在分類器的第二次訓(xùn)練其中，將會又一次調(diào)整每一個樣本的權(quán)重。其中第一次分隊的樣本的權(quán)重值將會減少。而第一次分錯的樣本的權(quán)重將會提高。

為了從全部分類器中得到終于的分類結(jié)果，AdaBoost為每一個分類器都分配了一個權(quán)重值alpha，這些alpha值是基于每一個分類器的錯誤率進行計算的。其中錯誤率定義為

?=為正確分類的樣本數(shù)目所有樣本數(shù)目

alpha計算公式

α=12ln(1???)

計算出alpha值之后，能夠?qū)?quán)重向量D進行更新，使得正確分類的樣本的權(quán)重值減少而分錯的樣本權(quán)重值升高，D的計算方法例如以下
假設(shè)某個樣本被正確分類。更新該樣本權(quán)重值為：

D(t+1)i=D(t)ie?αSum(D)

假設(shè)某個樣本被錯誤分類，更新該樣本的權(quán)重值為：

D(t+1)i=D(t)ieαSum(D)

計算出D后，AdaBoost接著開始下一輪的迭代。AdaBoost算法會不斷地反復(fù)訓(xùn)練和調(diào)整權(quán)重的過程，知道訓(xùn)練錯誤率為0或者若分類器的數(shù)目達到用戶指定值為止。

在建立完整的AdaBoost算法之前，須要通過一些代碼建立若分類器及保存數(shù)據(jù)集的權(quán)重。

算法描寫敘述：

3.基于單層決策樹構(gòu)建若分類器

單層決策樹是一種簡單的決策樹。首先構(gòu)建一個簡單的數(shù)據(jù)集,建立一個adaboost.py文件并增加下列代碼：

def loadSimpData():datMat = matrix([[ 1. , 2.1],[ 2. , 1.1],[ 1.3, 1. ],[ 1. , 1. ],[ 2. , 1. ]])classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]return datMat,classLabels

導(dǎo)入數(shù)據(jù)

>>> import adaboost >>> datMat,classLabels=adaboost.loadSimpData()

附：自適應(yīng)數(shù)據(jù)載入函數(shù)

def loadDataSet(fileName): #general function to parse tab -delimited floatsnumFeat = len(open(fileName).readline().split('\t')) #get number of fields dataMat = []; labelMat = []fr = open(fileName)for line in fr.readlines():lineArr =[]curLine = line.strip().split('\t')for i in range(numFeat-1):lineArr.append(float(curLine[i]))dataMat.append(lineArr)labelMat.append(float(curLine[-1]))return dataMat,labelMat

以下兩個函數(shù)，一個用于測試是否某個值小于或者大于我們正在測試的閾值，一個會在一個加權(quán)數(shù)據(jù)集中循環(huán)，并找到具有最低錯誤率的單層決策樹。

偽代碼例如以下：

將最小錯誤率minError設(shè)為無窮大對數(shù)據(jù)及中的每一個特征（第一層循環(huán)）：對每一個步長（第二層循環(huán)）：對每一個不等號（第三層循環(huán)）：建立一顆單層決策樹并利用加權(quán)數(shù)據(jù)集對它進行測試假設(shè)錯誤率低于minError。則將當(dāng)前單層決策樹設(shè)置為最佳單層決策樹返回最佳單層決策樹

單層決策樹生成函數(shù)代碼：

def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the dataretArray = ones((shape(dataMatrix)[0],1))if threshIneq == 'lt':retArray[dataMatrix[:,dimen] <= threshVal] = -1.0else:retArray[dataMatrix[:,dimen] > threshVal] = -1.0return retArraydef buildStump(dataArr,classLabels,D):dataMatrix = mat(dataArr); labelMat = mat(classLabels).Tm,n = shape(dataMatrix)numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))minError = inf #init error sum, to +infinityfor i in range(n):#loop over all dimensionsrangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();stepSize = (rangeMax-rangeMin)/numStepsfor j in range(-1,int(numSteps)+1):#loop over all range in current dimensionfor inequal in ['lt', 'gt']: #go over less than and greater thanthreshVal = (rangeMin + float(j) * stepSize)predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThanerrArr = mat(ones((m,1)))errArr[predictedVals == labelMat] = 0weightedError = D.T*errArr #calc total error multiplied by D#print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)if weightedError < minError:minError = weightedErrorbestClasEst = predictedVals.copy()bestStump['dim'] = ibestStump['thresh'] = threshValbestStump['ineq'] = inequalreturn bestStump,minError,bestClasEst

4.AdaBoost算法的實現(xiàn)

整個實現(xiàn)的偽代碼例如以下：

對每次迭代：利用buildStump()函數(shù)找到最佳的單層決策樹將最佳單層決策樹增加到單層決策樹數(shù)據(jù)中計算alpha計算心的權(quán)重向量D更新累計類別預(yù)計值假設(shè)錯誤率低于0.0 則退出循環(huán)

基于單層決策樹的AdaBoost訓(xùn)練過程

def adaBoostTrainDS(dataArr,classLabels,numIt=40):weakClassArr = []m = shape(dataArr)[0]D = mat(ones((m,1))/m) #init D to all equalaggClassEst = mat(zeros((m,1)))for i in range(numIt):bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump#print "D:",D.Talpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0bestStump['alpha'] = alpha weakClassArr.append(bestStump) #store Stump Params in Array#print "classEst: ",classEst.Texpon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messyD = multiply(D,exp(expon)) #Calc New D for next iterationD = D/D.sum()#calc training error of all classifiers, if this is 0 quit for loop early (use break)aggClassEst += alpha*classEst#print "aggClassEst: ",aggClassEst.TaggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))errorRate = aggErrors.sum()/mprint "total error: ",errorRateif errorRate == 0.0: breakreturn weakClassArr,aggClassEst

5.測試算法

擁有了多個若分類器以及其相應(yīng)的alpha值，進行測試就方便了。

AdaBoost分類函數(shù):利用訓(xùn)練處的多個若分類器進行分類的函數(shù)。

def adaClassify(datToClass,classifierArr):dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDSm = shape(dataMatrix)[0]aggClassEst = mat(zeros((m,1)))for i in range(len(classifierArr)):classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\classifierArr[i]['thresh'],\classifierArr[i]['ineq'])#call stump classifyaggClassEst += classifierArr[i]['alpha']*classEstprint aggClassEstreturn sign(aggClassEst)

6.繪制ROC曲線

ROC曲線繪制代碼：

def plotROC(predStrengths, classLabels):import matplotlib.pyplot as pltcur = (1.0,1.0) #cursorySum = 0.0 #variable to calculate AUCnumPosClas = sum(array(classLabels)==1.0)yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas)sortedIndicies = predStrengths.argsort()#get sorted index, it's reversefig = plt.figure()fig.clf()ax = plt.subplot(111)#loop through all the values, drawing a line segment at each pointfor index in sortedIndicies.tolist()[0]:if classLabels[index] == 1.0:delX = 0; delY = yStep;else:delX = xStep; delY = 0;ySum += cur[1]#draw line from cur to (cur[0]-delX,cur[1]-delY)ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c='b')cur = (cur[0]-delX,cur[1]-delY)ax.plot([0,1],[0,1],'b--')plt.xlabel('False positive rate'); plt.ylabel('True positive rate')plt.title('ROC curve for AdaBoost horse colic detection system')ax.axis([0,1,0,1])plt.show()print "the Area Under the Curve is: ",ySum*xStep

說明：文章中的代碼來自機器學(xué)習(xí)實戰(zhàn)。

References

【1】Machine Learning in Action 機器學(xué)習(xí)實戰(zhàn) 第七章

本欄目Machine Learning持續(xù)更新中。歡迎關(guān)注：Dream_Angel_Z博客

總結(jié)

以上是生活随笔為你收集整理的机器学习算法-Adaboost的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： CentOS6.5安装Tab增强版:ba
下一篇：打造TypeScript的Visual