當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

KNN-2实现

發(fā)布時間：2024/1/1 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 KNN-2实现小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

CIFAR-10的KNN實現(xiàn)

作業(yè)講解

KNN的實現(xiàn)主要分為兩步：

訓練：分類器簡單地記住所有的數(shù)據(jù)
測試：測試數(shù)據(jù)分別和所有訓練數(shù)據(jù)計算距離，選取k個最近的訓練樣本的label，通過投票（vote）獲得預測值。

代碼實現(xiàn)

#導入一些包 import random import numpy as np from cs231n.data_utils import load_CIFAR10 import matplotlib.pyplot as plt# 圖片展示在此處，不會出現(xiàn)新的窗口 %matplotlib inline plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray'

數(shù)據(jù)集下載到本地

# 下載訓練數(shù)據(jù)集 # 訓練集 trainset = datasets.CIFAR10(root='./CIFAR10',train=True,download=True,transform=transform) # 測試集 testset = datasets.CIFAR10(root='./CIFAR10',train=False,download=True,transform=transform)

讀取數(shù)據(jù)集

加載數(shù)據(jù)集以及劃分訓練集和測試集

# 加載CIFAR-10 data.這里我是提前下載到本地 cifar10_dir = './cs231n/datasets/cifar-10-batches-py' #下載到的本地地址#清理變量以防止多次加載數(shù)據(jù)（這可能會導致內存問題） try:del X_train, y_traindel X_test, y_testprint('Clear previously loaded data.') except:passX_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)# 打印出訓練和測試數(shù)據(jù)的大小一檢查是否合理 print('Training data shape: ', X_train.shape) print('Training labels shape: ', y_train.shape) print('Test data shape: ', X_test.shape) print('Test labels shape: ', y_test.shape)

輸出：訓練集中有50000張圖片，測試集有10000張圖片

部分樣本展示

展示一些訓練集樣本
數(shù)據(jù)集中總共有7個類別

# 可視化部分數(shù)據(jù)集 # 我們可以得到每一個類在訓練數(shù)據(jù)集上的展示 classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] num_classes = len(classes) samples_per_class =8 for y, cls in enumerate(classes):idxs = np.flatnonzero(y_train == y)#該函數(shù)輸入一個矩陣，返回扁平化后矩陣中非零元素的位置（index）idxs = np.random.choice(idxs, samples_per_class, replace=False)#返enumerate():回一個可迭代對象的枚舉形式，（上面第一個返回 0 plane/ 1 car/ 2 bird/……）for i, idx in enumerate(idxs):plt_idx = i * num_classes + y + 1plt.subplot(samples_per_class, num_classes, plt_idx)plt.imshow(X_train[idx].astype('uint8'))plt.axis('off')if i == 0:plt.title(cls) plt.show()

輸出：

在本練習中對數(shù)據(jù)進行子采樣以提高代碼執(zhí)行效率

# 在本練習中對數(shù)據(jù)進行子采樣以提高代碼執(zhí)行效率 num_training = 5000 mask = list(range(num_training))#產生訓練樣本的位置 X_train = X_train[mask]#選擇訓練樣本 y_train = y_train[mask]#確定訓練樣本標簽num_test = 500 mask = list(range(num_test)) X_test = X_test[mask] y_test = y_test[mask]# 把（50000，32，32，3）變成（5000，3072） #一個圖片用一個32323的一行來表示，相當于把圖片拉成一個行向量 X_train = np.reshape(X_train, (X_train.shape[0], -1)) X_test = np.reshape(X_test, (X_test.shape[0], -1)) print(X_train.shape, X_test.shape)

創(chuàng)建KNN分類器

import numpy as np from collections import Counterclass KNearestNeighbor(object):""" a kNN classifier with L2 distance """def __init__(self):passdef train(self, X, y):self.X_train = Xself.y_train = ydef predict(self, X, k=1, num_loops=0):if num_loops == 0:dists = self.compute_distances_no_loops(X)elif num_loops == 1:dists = self.compute_distances_one_loop(X)elif num_loops == 2:dists = self.compute_distances_two_loops(X)else:raise ValueError('Invalid value %d for num_loops' % num_loops)return self.predict_labels(dists, k=k)def compute_distances_two_loops(self, X):num_test = X.shape[0]num_train = self.X_train.shape[0]dists = np.zeros((num_test, num_train))for i in range(num_test):#測試樣本的循環(huán)for j in range(num_train):#訓練樣本的循環(huán) #dists[i,j]=np.sqrt(np.sum(np.square(self.X_train[j,:]-X[i,:])))dists[i,j]=np.linalg.norm(X[i]-self.X_train[j])#np.square是針對每個元素的平方方法 return distsdef compute_distances_one_loop(self, X):num_test = X.shape[0]num_train = self.X_train.shape[0]dists = np.zeros((num_test, num_train))for i in range(num_test):#dists[i,:] = np.sqrt(np.sum(np.square(self.X_train-X[i,:]),axis = 1))dists[i,:]=np.linalg.norm(X[i,:]-self.X_train[:],axis=1)return distsdef compute_distances_no_loops(self, X):num_test = X.shape[0]num_train = self.X_train.shape[0]dists = np.zeros((num_test, num_train)) """mul1 = np.multiply(np.dot(X,self.X_train.T),-2) sq1 = np.sum(np.square(X),axis=1,keepdims = True) sq2 = np.sum(np.square(self.X_train),axis=1) dists = mul1+sq1+sq2 dists = np.sqrt(dists) """dists += np.sum(np.multiply(X,X),axis=1,keepdims = True).reshape(num_test,1)dists += np.sum(np.multiply(self.X_train,self.X_train),axis=1,keepdims = True).reshape(1,num_train)dists += -2*np.dot(X,self.X_train.T)dists = np.sqrt(dists) return distsdef predict_labels(self, dists, k=1):num_test = dists.shape[0]y_pred = np.zeros(num_test)for i in range(num_test):closest_y = []closest_y = self.y_train[np.argsort(dists[i, :])[:k]].flatten()c = Counter(closest_y)y_pred[i]=c.most_common(1)[0][0]"""closest_y=self.y_train[np.argsort(dists[i, :])[:k]] y_pred[i] = np.argmax(np.bincount(closest_y))"""return y_pred

KNN訓練

classifier = KNearestNeighbor()#加載分類器 classifier.train(X_train, y_train)#在訓練集上訓練

KNN分類器是要把預測的樣本與訓練集比較，找到離訓練集最近的。所以train（）的作用是把訓練集X_train綁定到類KNearestNeighbor的屬性self.X_train

dists = classifier.compute_distances_two_loops(X_test) print(dists.shape) #dists 表示的是test集與training data的距離，dists[i, j] 表示的是test的第i個與training data的第j個的距離。 #out：（500，5000） plt.imshow(dists, interpolation='none') plt.show()

得到了dists以后，就可以預測test data里的圖片的分類。dists的第i行表示，test的第i個樣本與5000個訓練集數(shù)據(jù)的距離，找到距離最小的K個訓練集圖片，他們的圖片類型就是我們預測的結果y_test_pred。這里指定了k=1，在classifier.predict_labels函數(shù)中就只會找到距離最小的一個圖片，他的類別就是對test預測的類別。

預測

k=1時

#k = 1 y_test_pred = classifier.predict_labels(dists, k=1)# 得到預測的準確率 num_correct = np.sum(y_test_pred == y_test) accuracy = float(num_correct) / num_test print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)) #out：Got 137 / 500 correct => accuracy: 0.274000

k=5

y_test_pred = classifier.predict_labels(dists, k=5) num_correct = np.sum(y_test_pred == y_test) accuracy = float(num_correct) / num_test print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)) #out：Got 145 / 500 correct => accuracy: 0.290000

k=10

y_test_pred = classifier.predict_labels(dists, k=10) num_correct = np.sum(y_test_pred == y_test) accuracy = float(num_correct) / num_test print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)) #out：Got 144 / 500 correct => accuracy: 0.288000

時間計算

檢查一范數(shù)L1和二范數(shù)L2的距離度量公式算出來的結果是否一樣

#檢查兩次距離是否一樣dists，dists_one dists_one = classifier.compute_distances_one_loop(X_test) difference = np.linalg.norm(dists - dists_one, ord='fro') print('One loop difference was: %f' % (difference, )) if difference < 0.001:print('Good! The distance matrices are the same') else:print('Uh-oh! The distance matrices are different') '''out: One loop difference was: 0.000000 Good! The distance matrices are the same '''

計算不同距離度量下計算的時間結果

# Let's compare how fast the implementations are def time_function(f, *args):"""Call a function f with args and return the time (in seconds) that it took to execute."""import timetic = time.time()f(*args)toc = time.time()return toc - tictwo_loop_time = time_function(classifier.compute_distances_two_loops, X_test) print ('Two loop version took %f seconds' % two_loop_time)one_loop_time = time_function(classifier.compute_distances_one_loop, X_test) print ('One loop version took %f seconds' % one_loop_time)no_loop_time = time_function(classifier.compute_distances_no_loops, X_test) print ('No loop version took %f seconds' % no_loop_time) '''out: Two loop version took 24.126041 seconds One loop version took 51.611078 seconds No loop version took 0.319181 seconds'''

選擇K

利用Cross-validation選擇K值
超參數(shù) hyperparameter

機器學習算法設計中，如K-NN中的k的取值，以及計算像素差異時使用的距離公式，都是超參數(shù)，而調參更是機器學習中不可或缺的一步。

注意：調參要用validation set,而不是test set. 機器學習算法中，測試集只能被用作最后測試，得出結論，如果之前用了，就會出現(xiàn)過擬合的情況

交叉驗證cross validation set

這是當訓練集數(shù)量較小的時候，可以將訓練集平分成幾份，然后循環(huán)取出一份當做validation set 然后把每次結果最后取平均。如下：將訓練集分為如下幾部分：

num_folds = 5#5折交叉驗證 k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]X_train_folds = [] y_train_folds = []X_train_folds = np.array_split(X_train, num_folds) y_train_folds = np.array_split(y_train, num_folds)k_to_accuracies = {}for k in k_choices:k_to_accuracies[k] = np.zeros(num_folds)for i in range(num_folds):Xtr = np.array(X_train_folds[:i] + X_train_folds[i+1:])ytr = np.array(y_train_folds[:i] + y_train_folds[i+1:])Xte = np.array(X_train_folds[i])yte = np.array(y_train_folds[i]) Xtr = np.reshape(Xtr,(X_train.shape[0] * 4 // 5, -1))ytr = np.reshape(ytr,(y_train.shape[0] * 4 // 5, -1))Xte = np.reshape(Xte,(X_train.shape[0] // 5, -1))yte = np.reshape(yte,(y_train.shape[0]// 5, -1))classifier.train(Xtr,ytr)yte_pred = classifier.predict(Xte, k)yte_pred = np.reshape(yte_pred, (yte_pred.shape[0], -1))num_correct = np.sum(yte_pred == yte)accuracy = float(num_correct) / len(yte)k_to_accuracies[k][i] = accuracy # Print out the computed accuracies for k in sorted(k_to_accuracies):for accuracy in k_to_accuracies[k]:print ('k = %d, accuracy = %f' % (k, accuracy)) '''out k = 1, accuracy = 0.263000 k = 1, accuracy = 0.257000 k = 1, accuracy = 0.264000 k = 1, accuracy = 0.278000 k = 1, accuracy = 0.266000 k = 3, accuracy = 0.257000 k = 3, accuracy = 0.263000 k = 3, accuracy = 0.273000 k = 3, accuracy = 0.282000 k = 3, accuracy = 0.270000 k = 5, accuracy = 0.265000 k = 5, accuracy = 0.275000 k = 5, accuracy = 0.295000 k = 5, accuracy = 0.298000 k = 5, accuracy = 0.284000 k = 8, accuracy = 0.272000 k = 8, accuracy = 0.295000 k = 8, accuracy = 0.284000 k = 8, accuracy = 0.298000 k = 8, accuracy = 0.290000 k = 10, accuracy = 0.272000 k = 10, accuracy = 0.303000 k = 10, accuracy = 0.289000 k = 10, accuracy = 0.292000 k = 10, accuracy = 0.285000 k = 12, accuracy = 0.271000 k = 12, accuracy = 0.305000 k = 12, accuracy = 0.285000 k = 12, accuracy = 0.289000 k = 12, accuracy = 0.281000 k = 15, accuracy = 0.260000 k = 15, accuracy = 0.302000 k = 15, accuracy = 0.292000 k = 15, accuracy = 0.292000 k = 15, accuracy = 0.285000 k = 20, accuracy = 0.268000 k = 20, accuracy = 0.293000 k = 20, accuracy = 0.291000 k = 20, accuracy = 0.287000 k = 20, accuracy = 0.286000 k = 50, accuracy = 0.273000 k = 50, accuracy = 0.291000 k = 50, accuracy = 0.274000 k = 50, accuracy = 0.267000 k = 50, accuracy = 0.273000 k = 100, accuracy = 0.261000 k = 100, accuracy = 0.272000 k = 100, accuracy = 0.267000 k = 100, accuracy = 0.260000 k = 100, accuracy = 0.267000'''

繪制5折交叉驗證下不同k值得準確率如何

for k in k_choices:accuracies = k_to_accuracies[k]plt.scatter([k] * len(accuracies), accuracies)accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())]) accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())]) plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std) plt.title('Cross-validation on k') plt.xlabel('k') plt.ylabel('Cross-validation accuracy') plt.show()

根據(jù)上面的交叉驗證結果，選擇 k 的最佳值，
使用所有訓練數(shù)據(jù)重新訓練分類器，并在測試中進行測試數(shù)據(jù)。您應該能夠在測試數(shù)據(jù)上獲得超過28%的準確率。

best_k = 8 classifier = KNearestNeighbor() classifier.train(X_train, y_train) y_test_pred = classifier.predict(X_test, k=best_k)# Compute and display the accuracy num_correct = np.sum(y_test_pred == y_test) accuracy = float(num_correct) / num_test print ('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)) #out：Got 147 / 500 correct => accuracy: 0.294000

總結

以上是生活随笔為你收集整理的KNN-2实现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。