参考文献_参考文献:
參考文獻
算法介紹 (Algorithm introduction)
kNN (k nearest neighbors) is one of the simplest ML algorithms, often taught as one of the first algorithms during introductory courses. It’s relatively simple but quite powerful, although rarely time is spent on understanding its computational complexity and practical issues. It can be used both for classification and regression with the same complexity, so for simplicity, we’ll consider the kNN classifier.
kNN(k個最鄰近的鄰居)是最簡單的ML算法之一,通常在入門課程中被教為最早的算法之一。 它相對簡單但功能強大,盡管很少花費時間來了解其計算復雜性和實際問題。 它可以用于具有相同復雜度的分類和回歸,因此為簡單起見,我們將考慮kNN分類器。
kNN is an associative algorithm — during prediction it searches for the nearest neighbors and takes their majority vote as the class predicted for the sample. Training phase may or may not exist at all, as in general, we have 2 possibilities:
kNN是一種關聯(lián)算法-在預測過程中會搜索最近的鄰居,并將其多數(shù)票作為為樣本預測的類別。 培訓階段可能不存在,也可能根本不存在,因為一般來說,我們有兩種可能性:
We focus on the methods implemented in Scikit-learn, the most popular ML library for Python. It supports brute force, k-d tree and ball tree data structures. These are relatively simple, efficient and perfectly suited for the kNN algorithm. Construction of these trees stems from computational geometry, not from machine learning, and does not concern us that much, so I’ll cover it in less detail, more on the conceptual level. For more details on that, see links at the end of the article.
我們專注于Scikit-learn中實現(xiàn)的方法,Scikit-learn是最流行的Python ML庫。 它支持暴力,kd樹和球樹數(shù)據(jù)結構。 這些相對簡單,有效并且非常適合kNN算法。 這些樹的構建源于計算幾何,而不是來自機器學習,并且與我們的關系不大,因此我將在概念層面上更詳細地介紹它。 有關這方面的更多詳細信息,請參見本文結尾處的鏈接。
In all complexities below times of calculating the distance were omitted since they are in most cases negligible compared to the rest of the algorithm. Additionally, we mark:
在所有復雜情況下,省略了計算距離的時間,因為與算法的其余部分相比,它們在大多數(shù)情況下可以忽略不計。 此外,我們標記:
n: number of points in the training dataset
n :訓練數(shù)據(jù)集中的點數(shù)
d: data dimensionality
d :數(shù)據(jù)維數(shù)
k: number of neighbors that we consider for voting
k :我們考慮投票的鄰居數(shù)
蠻力法 (Brute force method)
Training time complexity: O(1)
訓練時間復雜度: O(1)
Training space complexity: O(1)
訓練空間復雜度: O(1)
Prediction time complexity: O(k * n)
預測時間復雜度: O(k * n)
Prediction space complexity: O(1)
預測空間復雜度: O(1)
Training phase technically does not exist, since all computation is done during prediction, so we have O(1) for both time and space.
從技術上講,訓練階段是不存在的,因為所有計算都是在預測過程中完成的,所以我們在時間和空間上都具有O(1) 。
Prediction phase is, as method name suggest, a simple exhaustive search, which in pseudocode is:
正如方法名稱所暗示的那樣,預測階段是一個簡單的詳盡搜索,在偽代碼中為:
Loop through all points k times:1. Compute the distance between currently classifier sample and
training points, remember the index of the element with the
smallest distance (ignore previously selected points)
2. Add the class at found index to the counterReturn the class with the most votes as a prediction
This is a nested loop structure, where the outer loop takes k steps and the inner loop takes n steps. 3rd point is O(1) and 4th is O(# of classes), so they are smaller. Therefore, we have O(n * k) time complexity.
這是一個嵌套循環(huán)結構,其中外循環(huán)采用k步,內循環(huán)采用n步。 第3個點是O(1) ,第4個點是O(1) O(# of classes) ,所以它們更小。 因此,我們有O(n * k)時間復雜度。
As for space complexity, we need a small vector to count the votes for each class. It’s almost always very small and is fixed, so we can treat is as a O(1) space complexity.
至于空間的復雜性,我們需要一個小的向量來計算每個類別的票數(shù)。 它幾乎總是很小并且是固定的,因此我們可以將其視為O(1)空間復雜度。
kd樹法 (k-d tree method)
Training time complexity: O(d * n * log(n))
訓練時間復雜度: O(d * n * log(n))
Training space complexity: O(d * n)
訓練空間復雜度: O(d * n)
Prediction time complexity: O(k * log(n))
預測時間復雜度: O(k * log(n))
Prediction space complexity: O(1)
預測空間復雜度: O(1)
During the training phase, we have to construct the k-d tree. This data structure splits the k-dimensional space (here k means k dimensions of space, don’t confuse this with k as a number of nearest neighbors!) and allows faster search for nearest points, since we “know where to look” in that space. You may think of it like a generalization of BST for many dimensions. It “cuts” space with axis-aligned cuts, diving points into groups in children nodes.
在訓練階段,我們必須構建kd樹。 此數(shù)據(jù)結構可分割k維空間(此處k表示k維空間,不要將k與最近鄰的數(shù)量混淆!)并允許更快地搜索最近的點,因為我們“知道在哪里看”那個空間。 您可能會認為它像BST在許多方面的概括。 它通過與軸對齊的切割“切割”空間,將潛水點分成子節(jié)點組。
Constructing the k-d tree is not a machine learning task itself, since it stems from computational geometry domain, so we won’t cover this in detail, only on conceptual level. The time complexity is usually O(d * n * log(n)), because insertion is O(log(n)) (similar to regular BST) and we have n points from the training dataset, each with d dimensions. I assume the efficient implementation of the data structure, i. e. it finds the optimal split point (median in the dimension) in O(n), which is possible with the median of medians algorithm. Space complexity is O(d * n) — note that it depends on dimensionality d, which makes sense, since it more dimensions correspond to more space divisions and larger tree (in addition to larger time complexity for the same reason).
構造kd樹本身并不是機器學習任務,因為它起源于計算幾何學領域,因此我們僅在概念級別上不對此進行詳細介紹。 時間復雜度通常為O(d * n * log(n)) ,因為插入為O(log(n)) (類似于常規(guī)BST),并且訓練數(shù)據(jù)集中有n個點,每個點的維數(shù)為d 。 我假設數(shù)據(jù)結構的有效實現(xiàn),即在O(n)中找到最佳分割點(維數(shù)的中位數(shù)O(n) ,這可以通過中位數(shù)算法的中位數(shù)來實現(xiàn)。 空間復雜度為O(d * n) -請注意,它取決于維數(shù)d ,這是有道理的,因為它的維數(shù)更多,對應于更多的空間劃分和更大的樹(除了出于相同原因的更大的時間復雜度)。
As for the prediction phase, the k-d tree structure naturally supports “k nearest point neighbors query” operation, which is exactly what we need for kNN. The simple approach is to just query k times, removing the point found each time — since query takes O(log(n)), it is O(k * log(n)) in total. But since k-d tree already cut space during construction, after a single query we approximately know where to look — we can just search the “surroundings” around that point. Therefore, practical implementations of k-d tree support querying for whole k neighbors at one time and with complexity O(sqrt(n) + k), which is much better for larger dimensionalities, which are very common in machine learning.
至于預測階段,kd樹結構自然支持“ k最近點鄰居查詢”操作,這正是我們對kNN所需要的。 簡單的方法是只查詢k倍,去掉點每次發(fā)現(xiàn)-因為查詢時間O(log(n)) ,它是O(k * log(n))的總額。 但是,由于kd樹已經在構造過程中削減了空間,因此在一次查詢后,我們大約知道了要看的地方–我們只需搜索該點周圍的“周圍環(huán)境”即可。 因此,kd樹的實際實現(xiàn)支持一次查詢整個k鄰居,并且查詢復雜度為O(sqrt(n) + k) ,這對于機器學習中非常常見的較大維度而言要好得多。
The above complexities are the average ones, assuming the balanced k-d tree. The O(log(n)) times assumed above may degrade up to O(n) for unbalanced trees, but if the median is used during the tree construction, we should always get a tree with approximately O(log(n)) insertion/deletion/search complexity.
假設平衡kd樹,上述復雜度是平均水平 。 上面假設的O(log(n))時間對于不平衡的樹可能會降級為O(n) ,但是如果在樹的構造過程中使用中位數(shù),則我們應該始終獲得插入了O(log(n))左右的樹/刪除/搜索復雜度。
球樹法 (ball tree method)
Training time complexity: O(d * n * log(n))
訓練時間復雜度: O(d * n * log(n))
Training space complexity: O(d * n)
訓練空間復雜度: O(d * n)
Prediction time complexity: O(k * log(n))
預測時間復雜度: O(k * log(n))
Prediction space complexity: O(1)
預測空間復雜度: O(1)
Ball tree algorithm takes another approach to dividing space where training points lie. In contrast to k-d trees, which divides space with median value “cuts”, ball tree groups points into “balls” organized into a tree structure. They go from the largest (root, with all points) to the smallest (leaves, with only a few or even 1 point). It allows fast nearest neighbor lookup because near neighbors are in the same or at least close “balls”.
球樹算法采用另一種方法來劃分訓練點所在的空間。 與kd樹相比,kd樹用中位數(shù)“切割”劃分空間,而球樹組則指向組織成樹結構的“球”。 它們從最大的(根,所有點)到最小的(葉,只有幾個或什至1個點)。 它允許快速的最近鄰居查找,因為附近鄰居處于相同或至少接近的“球”中。
During the training phase, we only need to construct the ball tree. There are a few algorithms for constructing the ball tree, but the one most similar to k-d tree (called “k-d construction algorithm” for that reason) is O(d * n * log(n)), the same as k-d tree.
在訓練階段,我們只需要構建球樹。 有幾種構造球樹的算法,但是與kd樹最相似的一種算法(因此被稱為“ kd構造算法”)是O(d * n * log(n)) ,與kd樹相同。
Because of the tree building similarity, the complexities of the prediction phase are also the same as for k-d tree.
由于樹的構建相似性,預測階段的復雜性也與kd樹相同。
在實踐中選擇方法 (Choosing the method in practice)
To summarize the complexities: brute force is the slowest in the big O notation, while both k-d tree and ball tree have the same lower complexity. How do we know which one to use then?
概括一下復雜性:在大O表示法中,蠻力是最慢的,而kd樹和球樹的復雜性都較低。 我們怎么知道該使用哪個呢?
To get the answer, we have to look at both training and prediction times, that’s why I have provided both. The brute force algorithm has only one complexity, for prediction, O(k * n). Other algorithms need to create the data structure first, so for training and prediction they get O(d * n * log(n) + k * log(n)), not taking into account the space complexity, which may also be important. Therefore, where the construction of the trees is frequent, the training phase may outweigh their advantage of faster nearest neighbor lookup.
要獲得答案,我們必須同時考慮訓練和預測時間,這就是我同時提供兩者的原因。 蠻力算法只有一種復雜性,即預測O(k * n) 。 其他算法需要首先創(chuàng)建數(shù)據(jù)結構,因此對于訓練和預測,它們得到O(d * n * log(n) + k * log(n)) ,而不考慮空間復雜度,這也可能很重要。 因此,在樹木建造頻繁的地方,訓練階段可能會超過其更快的最近鄰居查找的優(yōu)勢。
Should we use k-d tree or ball tree? It depends on the data structure — relatively uniform or “well behaved” data will make better use of k-d tree, since the cuts of space will work well (near points will be close in the leaves after all cuts). For more clustered data the “balls” from the ball tree will reflect the structure better and therefore allow for faster nearest neighbor search. Fortunately, Scikit-learn supports “auto” option, which will automatically infer the best data structure from the data.
我們應該使用kd樹還是球樹? 這取決于數(shù)據(jù)結構-相對統(tǒng)一或“表現(xiàn)良好”的數(shù)據(jù)將更好地利用kd樹,因為空間切割會很好地工作(所有切割之后,葉子附近的點將很靠近)。 對于更多的群集數(shù)據(jù),來自球樹的“球”將更好地反映結構,因此可以更快地進行最近鄰居搜索。 幸運的是,Scikit-learn支持“自動”選項,它將自動從數(shù)據(jù)中推斷出最佳的數(shù)據(jù)結構。
Let’s see this in practice on two case studies, which I’ve encountered in practice during my studies and job.
讓我們在兩個案例研究中在實踐中看到這一點,這是我在學習和工作期間在實踐中遇到的。
案例研究1:分類 (Case study 1: classification)
The more “traditional” application of the kNN is the classification of data. It often has quite a lot of points, e. g. MNIST has 60k training images and 10k test images. Classification is done offline, which means we first do the training phase, then just use the results during prediction. Therefore, if we want to construct the data structure, we only need to do so once. For 10k test images, let’s compare the brute force (which calculates all distances every time) and k-d tree for 3 neighbors:
kNN的更“傳統(tǒng)”應用是數(shù)據(jù)分類。 它通常有很多要點,例如MNIST具有60k訓練圖像和10k測試圖像。 分類是在離線狀態(tài)下完成的,這意味著我們首先要進行訓練階段,然后僅在預測過程中使用結果。 因此,如果我們要構造數(shù)據(jù)結構,則只需要做一次。 對于10k的測試圖像,讓我們比較3個鄰居的蠻力(每次都會計算所有距離)和kd樹:
Brute force (O(k * n)): 3 * 10,000 = 30,000
蠻力( O(k * n) ): 3 * 10,000 = 30,000
k-d tree (O(k * log(n))): 3 * log(10,000) ~ 3 * 13 = 39
kd樹( O(k * log(n)) ): 3 * log(10,000) ~ 3 * 13 = 39
Comparison: 39 / 30,000 = 0.0013
比較: 39 / 30,000 = 0.0013
As you can see, the performance gain is huge! The data structure method uses only a tiny fraction of the brute force time. For most datasets this method is a clear winner.
如您所見,性能提升巨大! 數(shù)據(jù)結構方法僅使用蠻力時間的一小部分。 對于大多數(shù)數(shù)據(jù)集,此方法無疑是贏家。
案例研究2:實時智能監(jiān)控 (Case study 2: real-time smart monitoring)
Machine Learning is commonly used for image recognition, often using neural networks. It’s very useful for real-time applications, where it’s often integrated with cameras, alarms etc. The problem with neural networks is that they often detect the same object 2 or more times — even the best architectures like YOLO have this problem. We can actually solve it with nearest neighbor search with a simple approach:
機器學習通常用于神經網(wǎng)絡的圖像識別。 它對于經常與攝像頭,警報器等集成在一起的實時應用程序非常有用。神經網(wǎng)絡的問題在于它們經常檢測同一對象兩次或兩次以上,即使像YOLO這樣的最佳架構也存在此問題。 實際上,我們可以使用一種簡單的方法通過最近鄰搜索來解決它:
The crucial part is searching for the closest center of another bounding box (point 2). Which algorithm should be used here? Typically we have only a few moving objects on camera, maybe up to 30–40. For such a small number, speedup from using data structures for faster lookup is negligible. Each frame is a separate image, so if we wanted to construct a k-d tree for example, we would have to do so for every frame, which may mean 30 times per second — a huge cost overall. Therefore, for such situation a simple brute force method works fastest and also has the smallest space requirement (which, with heavy neural networks or for embedded CPUs in cameras, may be important).
關鍵部分是搜索另一個邊界框(點2)的最近中心。 在此應使用哪種算法? 通常,我們在相機上只有幾個移動的物體,也許最多30–40。 對于這么小的數(shù)字,使用數(shù)據(jù)結構進行更快的查找所產生的加速作用可以忽略不計。 每個幀都是一幅單獨的圖像,因此,例如,如果我們要構建kd樹,則必須為每個幀都這樣做,這可能意味著每秒30次,這總成本很高。 因此,對于這種情況,簡單的蠻力方法工作最快,并且空間需求最小(對于大型神經網(wǎng)絡或相機中的嵌入式CPU,這可能很重要)。
摘要 (Summary)
kNN algorithm is a popular, easy and useful technique in Machine Learning, and I hope after reading this article you understand it’s complexities and real world scenarios where and how can you use this method.
kNN算法是機器學習中一種流行,簡單且有用的技術,我希望閱讀本文后,您可以了解它的復雜性和現(xiàn)實場景,以及在何處以及如何使用此方法。
翻譯自: https://towardsdatascience.com/k-nearest-neighbors-computational-complexity-502d2c440d5
參考文獻
總結
以上是生活随笔為你收集整理的参考文献_参考文献:的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 华为P50 Pro官方价格降至4488元
- 下一篇: 德邦物流:预计 2022 年归母净利同比