k 最近邻_k最近邻与维数的诅咒
k 最近鄰
機器學習模型和維數的詛咒 (Machine Learning models and the curse of dimensionality)
There is always a trade off between things in life. If you take up a certain path then there is always a possibility that you might have to compromise with some other parameter. Machine Learning models are no different, considering the case of k-Nearest Neighbor there has always been a problem which has a huge impact over classifiers that rely on pairwise distance and that problem is nothing but the “Curse of Dimensionality”. By the end of this article you will be able to create your own k-Nearest Neighbor Model and observe the impact of increasing the dimension to fit a data set. Let’s dig in!
生活中的事物之間總會有一個權衡。 如果您采用某條路徑,那么總是有可能不得不折衷一些其他參數。 機器學習模型也沒有什么不同,考慮到k最近鄰的情況,一直存在著一個問題,該問題對依賴成對距離的分類器產生了巨大影響,而這個問題不過是“維數詛咒”而已。 到本文結束時,您將能夠創建自己的k最近鄰居模型,并觀察增加維度以適合數據集的影響。 讓我們開始吧!
Creating a k-Nearest Neighbor model:
創建k最近鄰居模型:
Right before we get our hands dirty with the technical part, we need to lay the buttress for our analysis, which is nothing but the libraries.
就在我們開始接觸技術部分之前,我們需要為我們的分析奠定基礎,這不過是庫。
Thanks to inbuilt machine learning packages which makes our job quite easy.
借助內置的機器學習包,這使我們的工作變得非常輕松。
最近鄰居分類器: (Nearest neighbors classifier:)
Let’s begin with a simple nearest neighbor classifier in which we have been posed with a binary classification task: we have a set of labeled inputs, where the labels are all either 0 or 1. Our goal is to train a classifier to predict a 0 or 1 label for new, unseen test data. One conceptually simple approach is to simply find the sample in the training data that is “most similar” to our test sample (a “neighbor” in the feature space), and then give the test sample the same label as the “most similar” training sample. This is the nearest neighbors classifier.
讓我們從一個簡單的最近鄰分類器開始,在該分類器中,我們已經執行了一個二進制分類任務:我們有一組帶標簽的輸入,其中標簽全為0或1。我們的目標是訓練一個分類器來預測0或1。 1個標簽,用于顯示看不見的新測試數據。 從概念上講,一種簡單的方法是簡單地在訓練數據中找到與我們的測試樣本“最相似”(特征空間中的“鄰居”)的樣本,然后為測試樣本賦予與“最相似”的相同標簽訓練樣本。 這是最近的鄰居分類器。
After running few lines of code we can visualize our data set, with training data shown in blue (negative class) and red (positive class). A test sample is shown in green.For keeping things simple I have used a simple linear boundary for classification.
運行幾行代碼后,我們可以可視化我們的數據集,其中訓練數據以藍色(負類)和紅色(正類)顯示。 測試樣本以綠色顯示。為了使事情簡單,我使用了簡單的線性邊界進行分類。
To find the nearest neighbor, we need a distance metric. For our case, I chose to use the L2 norm. There certainly are few perks of using the L2 norm as a distance metric, considering that we don’t have any outliers the L2 norm minimizes the mean cost and treats every feature equally.
為了找到最近的鄰居,我們需要一個距離度量 。 對于我們的情況,我選擇使用L2范數。 考慮到我們沒有任何異常值,使用L2范數作為距離度量當然很少有好處,因為L2范數可以最大程度地降低平均成本并平等地對待每個特征。
The nearest neighbor to the test sample is circled, and its label is applied as the prediction for the test sample:
圈出最接近測試樣本的鄰居,并使用其標簽作為測試樣本的預測:
Nearest Neighbor classified最近鄰居分類Using nearest neighbor we successfully classified our test value as label “0”, but again we made an assumption of no outliers and we also moderated the noise.
使用最近的鄰居,我們成功地將測試值分類為標簽“ 0”,但是我們再次假設沒有離群值,并且也降低了噪聲。
The nearest neighbor classifier works by “memorizing” the training data. One interesting consequence of this is that it will have zero prediction error (or equivalently, 100% accuracy) on the training data, since each training sample’s nearest neighbor is itself:
最近的鄰居分類器通過“存儲”訓練數據來工作。 一個有趣的結果是,由于每個訓練樣本的最近鄰居本身都是零,因此在訓練數據上它將具有零預測誤差(或等效地,為100%的準確性):
Now we look to overcome the shortcomings of the nearest neighbor model and the answer lies in the model named as the k-Nearest Neighbor classifier.
現在,我們著眼于克服最鄰近模型的缺點,答案就在于名為k-最鄰近分類器的模型。
K個最近鄰居分類器: (K nearest neighbors classifier:)
To make this approach less sensitive to noise, we might choose to look for multiple similar training samples to each new test sample, and classify the new test sample using the mode of the labels of the similar training samples. This is k nearest neighbors, where k is the number of “neighbors” that we search for.
為了使這種方法對噪聲的敏感性降低,我們可以選擇為每個新的測試樣本尋找多個相似的訓練樣本,并使用相似的訓練樣本的標簽模式對新的測試樣本進行分類。 這是k個最近的鄰居,其中k是我們搜索的“鄰居”數。
In the following plot, we show the same data as in the previous example. Now, however, the 3 closest neighbors to the test sample are circled, and the mode of their labels is used as the prediction for the new test sample. Feel free to play with the parameter k and observe the changes.
在下圖中,我們顯示了與上一個示例相同的數據。 但是,現在,將最接近測試樣本的3個鄰居圈起來,并將其標簽的模式用作新測試樣本的預測。 隨意使用參數k并觀察其變化。
k-NN classifier with k=3k = 3的k-NN分類器The following image shows a set of test points plotted on top of the training data. The size of each test points indicate the confidence in the label, which we approximate by the proportion of k neighbors sharing that label.
下圖顯示了在訓練數據上方繪制的一組測試點。 每個測試點的大小表示對標簽的置信度 ,我們可以通過共享該標簽的k個鄰居的比例來近似。
Confidence score置信度分數The bigger the dots are means that the confidence score is higher for those points.
點越大表示這些點的置信度得分越高。
Also note that the training error for k nearest neighbors is not necessarily zero (though it can be!), since a training sample may have a different label than its k closest neighbors.
還應注意,k個最鄰近鄰居的訓練誤差不一定為零(盡管可能是!),因為訓練樣本可能具有與其k個最鄰近鄰居不同的標簽。
功能縮放: (Feature scaling:)
One important limitation of k nearest neighbors is that it does not “learn” anything about which features are most important for determining y. Every feature is weighted equally in finding the nearest neighbor.
k個最近鄰居的一個重要限制是它不“學習”關于哪些特征對于確定y最重要。 在尋找最接近的鄰居時,每個要素的權重均相等。
The first implication of this is:
這的第一個含義是:
- If all features are equally important, but they are not all on the same scale, they must be normalized — re scaled onto the interval [0,1]. Otherwise, the features with the largest magnitudes will dominate the total distance. 如果所有功能都同等重要,但是它們的縮放比例不同,則必須將它們歸一化-重新縮放為間隔[0,1]。 否則,幅度最大的要素將主導總距離。
The second implication is:
第二個含義是:
- Even if some features are more important than others, they will all be considered equally important in the distance calculation. If uninformative features are included, they may dominate the distance calculation. 即使某些功能比其他功能更重要,它們在距離計算中也將被視為同等重要。 如果包括非信息性特征,則它們可能會主導距離計算。
Contrast this with our logistic regression classifier. In the logistic regression, the training process involves learning coefficients. The coefficients weight each feature’s effect on the overall output.
將此與我們的邏輯回歸分類器進行對比。 在邏輯回歸中,訓練過程涉及學習系數。 系數加權每個功能對整體輸出的影響。
Let’s see how our model performs for an image classification problem. Consider the following images from CIFAR10, a dataset of low-resolution images in ten classes:
讓我們看看我們的模型如何處理圖像分類問題。 考慮以下來自CIFAR10的圖像,它是十類低分辨率圖像的數據集:
images classified as car分類為汽車的圖像The images above show a test sample and two training samples with their distances to the test sample.
上圖顯示了一個測試樣本和兩個訓練樣本以及它們與測試樣本的距離。
The background pixels in the test sample “count” just as much as the foreground pixels, so that the image of the deer is considered a very close neighbor, while the image of the car is not. As stated before we used L2 norm and our model considers every pixel to be equal so it makes it difficult for nearest neighbor to classify real time images.
測試樣本中的背景像素“計數”與前景像素一樣多,因此,鹿的圖像被認為是非常近的鄰居,而汽車的圖像則不是。 如前所述,我們使用L2范數,并且我們的模型認為每個像素都相等,因此最近鄰很難對實時圖像進行分類。
images classified as car分類為汽車的圖像We also see here that Euclidean distance is not a good metric of visual similarity — the frog on the right is almost as similar to the car as the deer in the middle!
我們在這里還看到,歐幾里得距離不是視覺相似度的良好度量標準-右側的青蛙與汽車之間的距離幾乎與中間的鹿一樣!
K最近鄰居回歸: (K nearest neighbors regression:)
K nearest neighbors can also be used for regression, with just a small change: instead of using the mode of the nearest neighbors to predict the label of a new sample, we use the mean. Consider the following training data:
K個最接近的鄰居也可以用于回歸,只做很小的改變:我們使用均值,而不是使用最接近的鄰居的模式來預測新樣本的標簽。 考慮以下訓練數據:
We can add a test sample, then use k nearest neighbors to predict its value:
我們可以添加一個測試樣本,然后使用k個最近的鄰居來預測其值:
“維數的詛咒”: (The “curse of dimensionality”:)
Classifiers that rely on pairwise distance between points, like the k neighbors methods, are heavily impacted by a problem known as the “curse of dimensionality”. In this section, I will illustrate the problem. We will look at a problem with data uniformly distributed in each dimension of the feature space, and two classes separated by a linear boundary.
像k鄰居方法一樣,依賴點之間成對距離的分類器受到稱為“維數詛咒”的問題的嚴重影響。 在本節中,我將說明問題。 我們將研究一個數據均勻分布在特征空間各個維度上的問題,并且兩個類之間由線性邊界分隔。
We will generate a test point, and show the k nearest neighbors to the test point. We will also show the length (or area, or volume) that we had to search to find those k test points. We will observe the radius required to find the nearest neighbor for increasing dimension space.
我們將生成一個測試點,并顯示距該測試點最近的k個鄰居。 我們還將顯示為找到這k個測試點而必須搜索的長度(或面積或體積)。 我們將觀察為增加尺寸空間而尋找最接近的鄰居所需的半徑。
Pay special attention to how that length (or area, or volume) changes as we increase the dimensionality of the feature space.
當我們增加特征空間的維數時,請特別注意長度(或面積或體積)如何變化。
First, let's observe the 1D problem:
首先,讓我們觀察一維問題:
1D space radius search一維空間半徑搜索Now, the 2D equivalent:
現在,等效于2D:
2D space radius search二維空間半徑搜索Finally, the 3D equivalent:
最后,等效于3D:
3D space radius search3D空間半徑搜索We can see that as the dimensionality of the problem grows, the higher-dimensional space is less densely occupied by the training data, and we need to search a large volume of space to find neighbors of the test point. The pair-wise distance between points grows as we add additional dimensions.
我們可以看到,隨著問題維數的增長,高維空間被訓練數據所占據的密度降低,并且我們需要搜索大量空間以找到測試點的鄰居。 點之間的成對距離隨著我們添加其他尺寸而增大 。
And in that case, the neighbors may be so far away that they don’t actually have much in common with the test point.
在這種情況下,鄰居可能相距太遠,以至于他們實際上與測試點沒有太多共同之處。
In general, the length of the smallest hyper-cube that contains all k-nearest neighbors of a test point is:
通常,包含測試點的所有k個最近鄰的最小超立方體的長度為:
(k/N)1/d
(k / N)1/ d
for N samples with dimensionality d.
對于N個維數為d的樣本。
From the expression above, we can see that as the number of dimensions increases linearly, the number of training samples must increase exponentially to counter the “curse”.
從上面的表達式中,我們可以看到,隨著維數線性增加,訓練樣本的數量必須成倍增加以抵消“詛咒”。
Alternatively, we can reduce d — either by feature selection or by transforming the data into a lower-dimensional space.
或者,我們可以通過特征選擇或將數據轉換為低維空間來減小d。
翻譯自: https://towardsdatascience.com/k-nearest-neighbors-and-the-curse-of-dimensionality-7d64634015d9
k 最近鄰
總結
以上是生活随笔為你收集整理的k 最近邻_k最近邻与维数的诅咒的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: cpu流水线工作原理_嵌入式工作原理(处
- 下一篇: 贴片电阻基本知识_贴片电阻怎么测试