《机器学习实战》笔记(02):k-近邻算法
生活随笔
收集整理的這篇文章主要介紹了
《机器学习实战》笔记(02):k-近邻算法
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
k-近鄰算法
它采用測量不同特征值之間的距離方法進行分類
- 優:精度高、對異常值不敏感、無數據輸入假定
- 缺:計算復雜度高、空間復雜度高
- 適用數據范圍:數值型和標稱型
算法具體原理
存在一個樣本數據集合(訓練樣品集),并且樣本集中每個數據都存在標簽,即我們知道樣本集中每一個數據與所屬分類的對應關系。
輸入沒有標簽的新數據后,將新數據的每個特征與樣品集中數據對應的特征進行比較,然后算法提取樣本集中特征最相近數據(最鄰近)的分類標簽。
一般來說,只選擇樣本數據集中前k個最相似的數據,這就是k-近鄰算法中K的出處,通常k是不大于20的整數。
最后,選擇k個最相似數據中出現次數最多的分類,作為新數據的分類。
kNN分類算法
kNN.py
偽代碼
步驟1用到 歐式距離公式
def classify0(inX, dataSet, labels, k):#距離計算dataSetSize = dataSet.shape[0] #numpy庫數組的行數diffMat = tile(inX, (dataSetSize,1)) - dataSet #tile復制inX dataSetSize行數倍,以便相減sqDiffMat = diffMat**2sqDistances = sqDiffMat.sum(axis=1) #axis=0表示按列相加,axis=1表示按照行的方向相加#開二次方distances = sqDistances**0.5sortedDistIndicies = distances.argsort() #將元素從小到大排序,提取對應的index,然后輸出返回,如x[3]=-1,y[0]=3#選擇距離最小的k個點classCount={} #python對象 for i in range(k):voteIlabel = labels[sortedDistIndicies[i]]classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #設置鍵 值 + 1#排序sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)#operator.itemgetter(1)返回函數,得classCount K-V的V,對V進行排序,因設置reverse,從左到右,從大到小return sortedClassCount[0][0]示例使用
def createDataSet():group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])labels = ['A','A','B','B']return group, labelsgroup, labels = createDataSet()#最后輸出值 print classify0([0, 0], group, labels, 3) #輸出 B示例1:使用k-近鄰算法改進約會網站的配對效果
1.準備數據:從文本中解析數據
def file2matrix(filename):fr = open(filename)numberOfLines = len(fr.readlines()) #get the number of lines in the filereturnMat = zeros((numberOfLines,3)) #prepare matrix to returnclassLabelVector = [] #prepare labels return fr = open(filename)index = 0for line in fr.readlines():line = line.strip()listFromLine = line.split('\t')returnMat[index,:] = listFromLine[0:3]classLabelVector.append(int(listFromLine[-1]))index += 1return returnMat,classLabelVector運行結果
>>>datingDataMat, datingLabels = kNN.file2matrix('datingTestSet2.txt')>>>print datingDataMat [[ 4.09200000e+04 8.32697600e+00 9.53952000e-01][ 1.44880000e+04 7.15346900e+00 1.67390400e+00][ 2.60520000e+04 1.44187100e+00 8.05124000e-01]..., [ 2.65750000e+04 1.06501020e+01 8.66627000e-01][ 4.81110000e+04 9.13452800e+00 7.28045000e-01][ 4.37570000e+04 7.88260100e+00 1.33244600e+00]]>>>print datingLabels[0:20] [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]2.分析數據:使用Matplotllib創建散點圖(需pip install matplotlib)
import matplotlib import matplotlib.pyplot as plt from numpy import * fig = plt.figure() ax = fig.add_subplot(111)#1行1列1圖 ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2]) #datingDataMat[:, 1] /*返回所有行,第2列*/ plt.show() #附帶尺寸、顏色參數 ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2] \, 15.0*array(datingLabels), 15.0*array(datingLabels)) plt.show()3.準備數據:歸一化數值
有時發現數據中的不同特征的特征值相差甚遠,會影響計算結果
處理方法
將數值歸一化,如將取值范圍處理為0到1或者-1到之間。公式:
newValue = (oldValue - min) / (max - min) def autoNorm(dataSet):minVals = dataSet.min(0)#從每列中選出最小值maxVals = dataSet.max(0)#從每列中選出最大值ranges = maxVals - minVals#范圍normDataSet = zeros(shape(dataSet))#行寬和dataSet相同的00矩陣m = dataSet.shape[0]#dataSet有多少個實例normDataSet = dataSet - tile(minVals, (m,1))#tile將數組A重復n次,上例子minVals,重復m次,1表示#tile(a,(2,1))就是把a先沿x軸(就這樣稱呼吧)復制1倍,即沒有復制,仍然是 [0,1,2]。 再把結果沿y方向復制2倍 normDataSet = normDataSet / tile(ranges, (m,1)) #element wise dividereturn normDataSet, ranges, minVals運行結果
>>>normMat, ranges, minVals = kNN.autoNorm(datingDataMat) >>>normMat [[ 0.44832535 0.39805139 0.56233353][ 0.15873259 0.34195467 0.98724416][ 0.28542943 0.06892523 0.47449629]..., [ 0.29115949 0.50910294 0.51079493][ 0.52711097 0.43665451 0.4290048 ][ 0.47940793 0.3768091 0.78571804]]>>>ranges [ 9.12730000e+04 2.09193490e+01 1.69436100e+00]>>>minVals [ 0. 0. 0.001156]4.測試算法:作為完整程序驗證分類器
機器學習算法一個重要的工作就是評估算法的正確率,通常提供已有數據的90%作為訓練樣品分類器,而使用其余的10%數據去測試分類器,檢測出分類器的正確率。
def datingClassTest():#取50%的數據進行測試hoRatio = 0.50 #hold out 10% #處理數據datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #load data setfrom file#數據歸一化處理normMat, ranges, minVals = autoNorm(datingDataMat)m = normMat.shape[0]#拿來 測試條目 數目numTestVecs = int(m*hoRatio)errorCount = 0.0for i in range(numTestVecs):#kNN核心算法classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)#輸出結果print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])#統計錯誤數if (classifierResult != datingLabels[i]): errorCount += 1.0print "the total error rate is: %f" % (errorCount/float(numTestVecs))print 'errorCount: '+str(errorCount)運行結果
>>>datingClassTest() ... the classifier came back with: 2, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the total error rate is: 0.064000 errorCount: 32.05.使用算法:構建完整可用系統
def classifyPerson(file):resultList = ['not at all','in small doses', 'in large doses']percentTats = float(raw_input(\"percentage of time spent playing video games?"))ffMiles = float(raw_input("frequent flier miles earned per year?"))iceCream = float(raw_input("liters of ice cream consumed per year?"))datingDataMat,datingLabels = file2matrix(file)normMat, ranges, minVals = autoNorm(datingDataMat)inArr = array([ffMiles, percentTats, iceCream])classifierResult = classify0((inArr- \minVals)/ranges,normMat,datingLabels,3)print "You will probably like this person: ",\resultList[classifierResult - 1]kNN.classifyPerson('..\\datingTestSet2.txt')結果
percentage of time spent playing video games?10 frequent flier miles earned per year?10000 liters of ice cream consumed per year?0.5 You will probably like this person: in small doses示例2:手寫識別系統
1.準備數據:將圖像裝換為測試向量
- trainingDigits 2000個訓練樣本
- testDigitsa 900個測試數據
某一示例:
00000000000001111000000000000000 00000000000011111110000000000000 00000000001111111111000000000000 00000001111111111111100000000000 00000001111111011111100000000000 00000011111110000011110000000000 00000011111110000000111000000000 00000011111110000000111100000000 00000011111110000000011100000000 00000011111110000000011100000000 00000011111100000000011110000000 00000011111100000000001110000000 00000011111100000000001110000000 00000001111110000000000111000000 00000001111110000000000111000000 00000001111110000000000111000000 00000001111110000000000111000000 00000011111110000000001111000000 00000011110110000000001111000000 00000011110000000000011110000000 00000001111000000000001111000000 00000001111000000000011111000000 00000001111000000000111110000000 00000001111000000001111100000000 00000000111000000111111000000000 00000000111100011111110000000000 00000000111111111111110000000000 00000000011111111111110000000000 00000000011111111111100000000000 00000000001111111110000000000000 00000000000111110000000000000000 00000000000011000000000000000000將把一個32 * 32的二進制圖像矩陣裝換為1 * 1024的向量
def img2vector(filename):returnVect = zeros((1,1024))#創建有1024個元素的列表fr = open(filename)for i in range(32):lineStr = fr.readline()for j in range(32):returnVect[0,32*i+j] = int(lineStr[j])return returnVect運行結果
>>>testVector = kNN.img2vector('testDigits/0_13.txt')>>>testVector[0, 0:31] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]>>>testVector[0, 32:63] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]2.測試算法:使用k-近鄰算法識別手寫數字
import osdef handwritingClassTest():#準備訓練數據和測試數據hwLabels = []trainingFileList = os.listdir('trainingDigits') #load the training setm = len(trainingFileList)trainingMat = zeros((m,1024))for i in range(m):fileNameStr = trainingFileList[i]fileStr = fileNameStr.split('.')[0] #take off .txtclassNumStr = int(fileStr.split('_')[0])hwLabels.append(classNumStr)trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)testFileList = os.listdir('testDigits') #iterate through the test seterrorCount = 0.0mTest = len(testFileList)#開始測試for i in range(mTest):fileNameStr = testFileList[i]fileStr = fileNameStr.split('.')[0] #take off .txtclassNumStr = int(fileStr.split('_')[0])vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)if (classifierResult != classNumStr): errorCount += 1.0print "\nthe total number of errors is: %d" % errorCountprint "\nthe total error rate is: %f" % (errorCount/float(mTest))運行結果
>>>handwritingClassTest()... the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9the total number of errors is: 11the total error rate is: 0.011628實際使用這個算法時,算法的執行效率并不高
總結
- k決策樹就是k-近鄰算法的優化版,可以節省大量的計算開銷
- k-近鄰算法是分類數據最簡單最有效的算法。
- k-近鄰算法是基于實例的學習,使用算法時我們必須有接近實際數據的訓練樣本數據。
- k-近鄰算法必須保存全部數據集,如果訓練數據集很大,必須使用大量的存儲空間。
- 此外,由于必須對數據集中的每個數據計算距離值,實際使用時可能非常耗時。
- k-近鄰算法的另一個缺陷是它無法給出任何數據的基礎結構信息,因此無法知曉平均實例樣本的典型實例樣本具有什么特征。
總結
以上是生活随笔為你收集整理的《机器学习实战》笔记(02):k-近邻算法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: QT5生成.exe文件时,出现缺少QT5
- 下一篇: 大数据学习(2-2)- 使用docker