近似算法的近似率_选择最佳近似最近算法的数据科学家指南
近似算法的近似率
by Braden Riggs and George Williams (gwilliams@gsitechnology.com)
Braden Riggs和George Williams(gwilliams@gsitechnology.com)
Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the term, ‘nearest-neighbor search’, or, ‘similarity search’. In fact, if you have ever used a search engine, recommender, translation tool, or pretty much anything else on the internet then you have probably made use of some form of nearest-neighbor algorithm. These algorithms, the ones that permeate most modern software, solve a very simple yet incredibly common problem. Given a data point, what is the closest match from a large selection of data points, or rather what point is most like the given point? These problems are “nearest-neighbor” search problems and the solution is an Approximate Nearest Neighbor algorithm or ANN algorithm for short.
無(wú)論您是數(shù)據(jù)科學(xué)領(lǐng)域的新手還是經(jīng)驗(yàn)豐富的資深人士,您都可能接觸過(guò)“最近鄰居搜索”或“相似搜索”一詞。 實(shí)際上,如果您曾經(jīng)使用搜索引擎,推薦器,翻譯工具或互聯(lián)網(wǎng)上的幾乎所有其他工具,那么您可能已經(jīng)在使用某種形式的最近鄰居算法。 這些算法已滲透到大多數(shù)現(xiàn)代軟件中,解決了一個(gè)非常簡(jiǎn)單但難以置信的常見(jiàn)問(wèn)題。 給定一個(gè)數(shù)據(jù)點(diǎn),從大量數(shù)據(jù)點(diǎn)中選擇最接近的匹配是什么 ,或者最像給定點(diǎn)的是哪個(gè)點(diǎn)? 這些問(wèn)題是“最近鄰居”搜索 問(wèn)題和解決方案是簡(jiǎn)稱為“ 近似最近鄰居”算法或ANN算法。
Approximate nearest-neighbor algorithms or ANN’s are a topic I have blogged about heavily, and with good reason. As we attempt to optimize and solve the nearest-neighbor challenge, ANN’s continue to be at the forefront of elegant and optimal solutions to these problems. Introductory Machine learning classes often include a segment about ANN’s older brother kNN, a conceptually simpler style of nearest-neighbor algorithm that is less efficient but easier to understand. If you aren’t familiar with kNN algorithms, they essentially work by classifying unseen points based on “k” number of nearby points, where the vicinity or distance of the nearby points are calculated by distance formulas such as euclidian distance.
近似最近鄰算法或ANN是我在博客上大量談?wù)摰闹黝},并且有充分的理由。 在我們嘗試優(yōu)化和解決最鄰近的挑戰(zhàn)時(shí),ANN始終處于解決這些問(wèn)題的最佳方案的最前沿。 機(jī)器學(xué)習(xí)入門課程通常包括有關(guān)ANN的哥哥kNN的部分,kNN是概念上更簡(jiǎn)單的近鄰算法樣式,效率較低,但更易于理解。 如果您不熟悉kNN算法,則它們實(shí)際上是通過(guò)基于“ k”個(gè)鄰近點(diǎn)數(shù)對(duì)看不見(jiàn)的點(diǎn)進(jìn)行分類來(lái)工作的,其中,鄰近點(diǎn)的鄰近度或距離是通過(guò)諸如歐幾里得距離的距離公式來(lái)計(jì)算的。
ANN’s work similarly but with a few more techniques and strategies that ensure greater efficiency. I go into more depth about these techniques in an earlier blog here. In this blog, I describe an ANN as:
ANN的工作與此類似,但是有更多的技術(shù)和策略可以確保更高的效率。 我在這里先前的博客中對(duì)這些技術(shù)進(jìn)行了更深入的介紹。 在此博客中, 我將ANN描述為 :
A faster classifier with a slight trade-off in accuracy, utilizing techniques such as locality sensitive hashing to better balance speed and precision.- Braden Riggs, How to Benchmark ANN Algorithms
一種更快的分類器,在精度上會(huì)稍有取舍,利用諸如位置敏感的哈希值之類的技術(shù)來(lái)更好地平衡速度和精度。- Braden Riggs,如何對(duì)ANN算法進(jìn)行基準(zhǔn)測(cè)試
The problem with utilizing the power of ANNs for your own projects is the sheer quantity of different implementations open to the public, each having their own benefits and disadvantages. With so many choices available how can you pick which is right for your project?
在您自己的項(xiàng)目中使用ANN的功能所帶來(lái)的問(wèn)題是,向公眾開(kāi)放的不同實(shí)現(xiàn)的數(shù)量龐大,每個(gè)實(shí)現(xiàn)都有其自身的優(yōu)缺點(diǎn)。 有這么多選擇,您如何選擇最適合您的項(xiàng)目?
Bernhardsson和ANN救援基準(zhǔn): (Bernhardsson and ANN-Benchmarks to the Rescue:)
For this project, we need a little help from the experts. Photo by Tra Nguyen on Unsplash對(duì)于這個(gè)項(xiàng)目,我們需要專家的一點(diǎn)幫助。 Tra Nguyen在Unsplash上拍攝的照片We have established that there are a range of ANN implementations available for use. However, we need a way of picking out the best of the best, the cream of the crop. This is where Aumüller, Bernhardsson, and Faithfull’s paper ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms and its corresponding GitHub repository comes to our rescue.
我們已經(jīng)建立了一系列可供使用的ANN實(shí)現(xiàn)。 但是,我們需要一種方法來(lái)挑選最好的農(nóng)作物。 這是Aumüller,Bernhardsson和Faithfull的論文ANN基準(zhǔn):近似最近鄰居算法的基準(zhǔn)工具 并且其相應(yīng)的GitHub存儲(chǔ)庫(kù)可為我們提供幫助。
The project, which I have discussed in the past, is a great starting point for choosing the algorithm that is the best fit for your project. The paper uses some clever techniques to evaluate the performance of a number of ANN implementations on a selection of datasets. It has these ANN algorithms solve nearest-neighbor queries to determine the accuracy and efficiency of the algorithm at different parameter combinations. The algorithm uses these queries to locate the 10 nearest data points to the queried point and evaluates how close each point is to the true neighbor, which is a metric called Recall. This is then scaled against how quickly the algorithm was able to accomplish its goal, which it called Queries per Second. This metric provides a great reference for determining which algorithms may be most preferential for you and your project.
我過(guò)去討論過(guò)的項(xiàng)目是選擇最適合您項(xiàng)目的算法的一個(gè)很好的起點(diǎn)。 本文使用一些巧妙的技術(shù)來(lái)評(píng)估多種ANN實(shí)施對(duì)所選數(shù)據(jù)集的性能。 它具有這些ANN算法來(lái)解決最近鄰居查詢,以確定算法在不同參數(shù)組合下的準(zhǔn)確性和效率。 該算法使用這些查詢來(lái)定位到查詢點(diǎn)最近的10個(gè)數(shù)據(jù)點(diǎn),并評(píng)估每個(gè)點(diǎn)與真實(shí)鄰居的接近程度,這是一個(gè)稱為“回叫”的度量。 然后,根據(jù)算法能夠?qū)崿F(xiàn)其目標(biāo)的速度(稱為“每秒查詢”)進(jìn)行縮放。 該指標(biāo)為確定哪種算法可能最適合您和您的項(xiàng)目提供了很好的參考。
Screenshot from an earlier blog where I recreated Bernhardsson’s results benchmarking ANN algorithms on the Gloce-25-angular NLP dataset. Read more here. Image by Author.我在一個(gè)較早的博客中截取了屏幕快照,在該博客中我重新創(chuàng)建了Bernhardsson的結(jié)果,該結(jié)果在Gloce-25角NLP數(shù)據(jù)集上對(duì)ANN算法進(jìn)行了基準(zhǔn)測(cè)試。 在這里 。 圖片由作者提供。Part of conducting this experiment requires picking the algorithms we want to test, and the dataset we want to perform the queries on. Based off of the experiments I have conducted on my previous blogs, narrowing down the selection of algorithms wasn’t difficult. In Bernhardsson’s original project he includes 18 algorithms. Given the performance I had seen in my first blog, using the glove-25 angular natural language dataset, there are 9 algorithms worth considering for our benchmark experiment. This is because some algorithms perform so slowly and so poorly that they aren’t even worth considering in this experiment. The algorithms selected are:
進(jìn)行此實(shí)驗(yàn)的一部分需要選擇我們要測(cè)試的算法,以及我們要對(duì)其執(zhí)行查詢的數(shù)據(jù)集。 根據(jù)我在以前的博客上進(jìn)行的實(shí)驗(yàn),縮小算法的選擇范圍并不困難。 在Bernhardsson的原始項(xiàng)目中,他包括18種算法。 鑒于我在第一個(gè)博客中看到的性能,使用了Gloves-25角度自然語(yǔ)言數(shù)據(jù)集,有9種算法值得我們進(jìn)行基準(zhǔn)測(cè)試。 這是因?yàn)槟承┧惴ǖ膱?zhí)行速度如此之慢且如此差,以至于在本實(shí)驗(yàn)中甚至都不值得考慮。 選擇的算法是:
Annoy: Spotify's “Approximate Nearest Neighbors Oh Yeah” ANN implementation.
煩惱: Spotify的 “哦,是,最近的鄰居” ANN實(shí)現(xiàn)。
Faiss: The suite of algorithms Facebook uses for large dataset similarity search including Faiss-lsh, Faiss-hnsw, and Faiss-ivf.
Faiss: Facebook用于大型數(shù)據(jù)集相似性搜索的算法套件,包括Faiss-lsh , Faiss-hnsw和Faiss-ivf 。
Flann: Fast Library for ANN.
Flann: ANN的快速庫(kù)。
HNSWlib: Hierarchical Navigable Small World graph ANN search library.
HNSWlib:分層可導(dǎo)航小世界圖ANN搜索庫(kù)。
NGT-panng: Yahoo Japan’s Neighborhood Graph and Tree for Indexing High-dimensional Data.
NGT-panng: Yahoo Japan的鄰域圖和樹,用于索引高維數(shù)據(jù)。
Pynndescent: Python implementation of Nearest Neighbor Descent for k-neighbor-graph construction and ANN search.
Pynndescent:用于k鄰域圖構(gòu)建和ANN搜索的Nearest Neighbor Descent的Python實(shí)現(xiàn)。
SW-graph(nmslib): Small world graph ANN search as part of the non-metric space library.
SW-graph(nmslib):小世界圖ANN搜索,作為非度量空間庫(kù)的一部分。
In addition to the algorithms, it was important to pick a dataset that would help distinguish the optimal ANN implementations from the not so optimal ANN implementations. For this task, we chose 1% — or a 10 million vector slice — of the gargantuan Deep-1-billion dataset, a 96 dimension computer vision training dataset. This dataset is large enough for inefficiencies in the algorithms to be accentuated and provide a relevant challenge for each one. Because of the size of the dataset and the limited specification of our hardware, namely the 64GBs of memory, some algorithms were unable to fully run to an accuracy of 100%. To help account for this, and to ensure that background processes on our machine didn’t interfere with our results, each algorithm and all of the parameter combinations were run twice. By doubling the number of benchmarks conducted, we were able to average between the two runs, helping account for any interruptions on our hardware.
除算法外,重要的是選擇一個(gè)有助于區(qū)分最佳ANN實(shí)現(xiàn)與非最佳ANN實(shí)現(xiàn)的數(shù)據(jù)集。 為此,我們選擇了龐大的Deep-billion數(shù)據(jù)集(96維計(jì)算機(jī)視覺(jué)訓(xùn)練數(shù)據(jù)集)的1%(即一千萬(wàn)個(gè)矢量切片)。 該數(shù)據(jù)集足夠大,可以突出算法的低效率,并為每個(gè)算法帶來(lái)相關(guān)挑戰(zhàn)。 由于數(shù)據(jù)集的大小和我們硬件的有限規(guī)格(即64GB內(nèi)存),某些算法無(wú)法完全運(yùn)行到100%的精度。 為了解決這個(gè)問(wèn)題,并確保我們機(jī)器上的后臺(tái)進(jìn)程不會(huì)干擾我們的結(jié)果,每種算法和所有參數(shù)組合都運(yùn)行兩次。 通過(guò)將執(zhí)行的基準(zhǔn)測(cè)試數(shù)量加倍,我們可以在兩次運(yùn)行之間求平均值,從而幫助解決硬件上的任何中斷。
This experiment took roughly 11 days to complete but yielded some helpful and insightful results.
該實(shí)驗(yàn)大約花費(fèi)了11天的時(shí)間,但得出了一些有益而有見(jiàn)地的結(jié)果。
我們發(fā)現(xiàn)了什么? (What did we find?)
After the exceptionally long runtime, the experiment completed with only three algorithms failing to fully reach an accuracy of 100%. These algorithms were Faiss-lsh, Flann, and NGT-panng. Despite these algorithms not reaching perfect accuracy, their results are useful and indicate where the algorithm may have been heading if we had experimented with more parameter combinations and didn't exceed memory usage on our hardware.
經(jīng)過(guò)異常長(zhǎng)的運(yùn)行時(shí)間后,實(shí)驗(yàn)僅用三種算法就無(wú)法完全達(dá)到100%的精度。 這些算法是Faiss-lsh , Flann和NGT-panng 。 盡管這些算法沒(méi)有達(dá)到完美的精度,但是它們的結(jié)果還是有用的,它們表明了如果我們嘗試了更多的參數(shù)組合并且未超過(guò)硬件上的內(nèi)存使用量,該算法可能會(huì)前進(jìn)。
Before showing off the results, let’s quickly discuss how we are presenting these results and what terminology you need to understand. On the y-axis, we have Queries per Second or QPS. QPS quantifies the number of nearest-neighbor searches that can be conducted in a second. This is sometimes referred to as the inverse ‘latency’ of the algorithm. More precisely QPS is a bandwidth measure and is inversely proportional to the latency. As the query time goes down, the bandwidth will increase. On the x-axis, we have Recall. In this case, Recall essentially represents the accuracy of the function. Because we are finding the 10 nearest-neighbors of a selected point, the Recall score takes the distances of the 10 nearest-neighbors our algorithms computed and compares them to the distance of the 10 true nearest-neighbors. If the algorithm selects the correct 10 points it will have a distance of zero from the true values and hence a Recall of 1. When using ANN algorithms we are constantly trying to maximize both of these metrics. However, they often improve at each other’s expense. When you speed up your algorithm, thereby improving latency, it becomes less accurate. On the other hand, when you prioritize its accuracy, thereby improving Recall, the algorithm slows down.
在展示結(jié)果之前,讓我們快速討論一下我們?nèi)绾纬尸F(xiàn)這些結(jié)果以及您需要了解哪些術(shù)語(yǔ)。 在y軸上,我們有每秒查詢數(shù)或QPS。 QPS量化了每秒可以進(jìn)行的最近鄰居搜索的次數(shù)。 有時(shí)將其稱為算法的逆“潛伏期”。 更準(zhǔn)確地說(shuō),QPS是帶寬量度,與延遲成反比。 隨著查詢時(shí)間的減少,帶寬將增加。 在x軸上,我們有Recall 。 在這種情況下,調(diào)用實(shí)質(zhì)上代表了函數(shù)的準(zhǔn)確性。 由于我們正在查找選定點(diǎn)的10個(gè)最近鄰居,因此Recall分?jǐn)?shù)將采用我們的算法計(jì)算出的10個(gè)最近鄰居的距離,并將它們與10個(gè)真實(shí)最近鄰居的距離進(jìn)行比較。 如果該算法選擇了正確的10個(gè)點(diǎn),則它與真實(shí)值的距離為零,因此召回率為1。使用ANN算法時(shí),我們一直在努力使這兩個(gè)指標(biāo)最大化。 但是,它們通常會(huì)以互相犧牲為代價(jià)而有所改善。 當(dāng)您加快算法速度從而改善延遲時(shí),它的準(zhǔn)確性就會(huì)降低。 另一方面,當(dāng)您優(yōu)先考慮其準(zhǔn)確性從而提高查全率時(shí),該算法會(huì)變慢。
Pictured below is the plot of Queries Performed per Second, over the Recall of the algorithm:
下圖是算法調(diào)用時(shí)每秒執(zhí)行的查詢的圖:
The effectiveness of each algorithm as evaluated by Queries per Second which is scaled logarithmically and Recall (accuracy). The further up and to the right the algorithm's line is, the better said algorithm performed. Image by Author.通過(guò)每秒查詢數(shù)評(píng)估的每種算法的有效性,該算法的對(duì)數(shù)和查全率(準(zhǔn)確度)均按比例縮放。 算法行越靠右,表示的算法執(zhí)行得越好。 圖片由作者提供。As evident by the graph above there were some clear winners and some clear losers. Focusing on the winners, we can see a few algorithms that really stand out, namely HNSWlib (yellow) and NGT-panng (red) both of which performed at a high accuracy and a high speed. Even though NGT never finished, the results do indicate it was performing exceptionally well prior to a memory-related failure.
從上圖可以明顯看出,有一些明顯的贏家和一些明顯的輸家。 著眼于獲勝者,我們可以看到一些真正脫穎而出的算法,即HNSWlib(黃色)和NGT-panng(紅色),它們均以高精度和高速執(zhí)行。 盡管NGT從未完成,但結(jié)果確實(shí)表明它在與內(nèi)存相關(guān)的故障之前表現(xiàn)出色。
So given these results, we now know which algorithms to pick for our next project right?
因此,鑒于這些結(jié)果,我們現(xiàn)在知道為下一個(gè)項(xiàng)目選擇哪種算法對(duì)嗎?
Unfortunately, this graph doesn’t depict the full story when it comes to the efficiency and accuracy of these ANN implementations. Whilst HNSWlib and NGT-panng can perform quickly and accurately, that is only after they have been built. “Build time” refers to the length of time that is required for the algorithm to construct its index and begin querying neighbors. Depending on the implementation of the algorithm, build time can be a few minutes or a few hours. Graphed below is the average algorithm build time for our benchmark excluding Faiss-HNSW which took 1491 minutes to build (about 24 hours):
不幸的是,當(dāng)涉及到這些ANN實(shí)現(xiàn)的效率和準(zhǔn)確性時(shí),該圖并沒(méi)有完整描述。 雖然HNSWlib和NGT-panng可以快速而準(zhǔn)確地執(zhí)行,但這只是在它們構(gòu)建之后。 “構(gòu)建時(shí)間”是指算法構(gòu)建其索引并開(kāi)始查詢鄰居所需的時(shí)間長(zhǎng)度。 根據(jù)算法的實(shí)現(xiàn),構(gòu)建時(shí)間可能是幾分鐘或幾小時(shí)。 下圖是我們的基準(zhǔn)測(cè)試的平均算法構(gòu)建時(shí)間, 不包括Faiss-HNSW,該過(guò)程花費(fèi)了1491分鐘的構(gòu)建時(shí)間(約24小時(shí)) :
Average build time, in minutes, for each algorithm tested excluding Faiss-HNSW which took 24 hours to build. Note how some of the algorithms that ran quickly took longer to build. Image by Author.測(cè)試的每種算法的平均構(gòu)建時(shí)間(以分鐘為單位)(不包括Faiss-HNSW花費(fèi)的24小時(shí)構(gòu)建時(shí)間)。 請(qǐng)注意,快速運(yùn)行的某些算法是如何花較長(zhǎng)時(shí)間構(gòu)建的。 圖片由作者提供。As we can see the picture changes substantially when we account for the time spend “building” the algorithm’s indexes. This index is essentially a roadmap for the algorithm to follow on its journey to find the nearest-neighbor. It allows the algorithm to take shortcuts, accelerating the time taken to find a solution. Depending on the size of the dataset and how intricate and comprehensive this roadmap is, build-time can be between a matter of seconds and a number of days. Although accuracy is always a top priority, depending on the circumstances it may be advantageous to choose between algorithms that build quickly or algorithms that run quickly:
正如我們看到的那樣,當(dāng)我們考慮“構(gòu)建”算法索引所花費(fèi)的時(shí)間時(shí),情況會(huì)發(fā)生很大的變化。 該索引本質(zhì)上是該算法在查找最近鄰居的過(guò)程中要遵循的路線圖。 它允許算法采用快捷方式,從而加快了找到解決方案的時(shí)間。 根據(jù)數(shù)據(jù)集的大小以及此路線圖的復(fù)雜程度,構(gòu)建時(shí)間可能在幾秒鐘到幾天之間。 盡管準(zhǔn)確性始終是頭等大事,但根據(jù)具體情況,在快速構(gòu)建的算法或快速運(yùn)行的算法之間進(jìn)行選擇可能會(huì)比較有利:
Scenario #1: You have a dataset that updates regularly but isn’t queried often, such as a school’s student attendance record or a government’s record of birth certificates. In this case, you wouldn’t want an algorithm that builds slowly because each time more data is added to the set, the algorithm must rebuild it’s index to maintain a high accuracy. If your algorithm builds slowly this could waste valuable time and energy. Algorithms such as Faiss-IVF are perfect here because they build fast and are still very accurate.
場(chǎng)景1:您有一個(gè)定期更新但不經(jīng)常查詢的數(shù)據(jù)集,例如學(xué)校的學(xué)生出勤記錄或政府的出生證明記錄。 在這種情況下,您不希望算法構(gòu)建緩慢,因?yàn)槊看螌⒏鄶?shù)據(jù)添加到集合中時(shí),該算法必須重建其索引以保持較高的準(zhǔn)確性。 如果算法構(gòu)建緩慢,可能會(huì)浪費(fèi)寶貴的時(shí)間和精力。 Faiss-IVF之類的算法在這里非常理想,因?yàn)樗鼈儤?gòu)建速度很快并且仍然非常準(zhǔn)確。
Scenario #2: You have a static dataset that doesn’t change often but is regularly queried, like a list of words in a dictionary. In this case, it is more preferential to use an algorithm that is able to perform more queries per second, at the expense of built time. This is because we aren’t adding new data regularly and hence don’t need to rebuild the index regularly. Algorithms such as HNSWlib or NGT-panng are perfect for this because they are accurate and fast, once the build is completed.
場(chǎng)景2:您有一個(gè)靜態(tài)數(shù)據(jù)集,該數(shù)據(jù)集不會(huì)經(jīng)常更改,而是會(huì)定期查詢,例如字典中的單詞列表。 在這種情況下,更可取的是使用能夠每秒執(zhí)行更多查詢的算法,但會(huì)浪費(fèi)構(gòu)建時(shí)間。 這是因?yàn)槲覀儾粫?huì)定期添加新數(shù)據(jù),因此不需要定期重建索引。 HNSWlib或NGT-panng之類的算法非常適合此操作,因?yàn)橐坏?gòu)建完成,它們便準(zhǔn)確且快速。
There is a third scenario worth mentioning. In my experiments attempting to benchmark ANN algorithms on larger and larger portions of the deep1b dataset, available memory started to become a major limiting factor. Hence, picking an algorithm with efficient use of memory can be a major advantage. In this case, I would highly recommend the Faiss suite of algorithms which have been engineered to perform under some of the most memory starved conditions.
還有第三種情況值得一提。 在我的實(shí)驗(yàn)中,試圖在Deep1b數(shù)據(jù)集的越來(lái)越大的部分上對(duì)ANN算法進(jìn)行基準(zhǔn)測(cè)試 ,可用內(nèi)存開(kāi)始成為主要的限制因素。 因此,選擇一種有效利用內(nèi)存的算法可能是一個(gè)主要優(yōu)勢(shì)。 在這種情況下,我強(qiáng)烈建議使用Faiss算法套件,這些套件經(jīng)設(shè)計(jì)可在某些內(nèi)存不足的情況下執(zhí)行。
Regardless of the scenario, we almost always want high accuracy. In our case accuracy, or recall, is evaluated based on the algorithm’s ability to correctly determine the 10 nearest-neighbors of a given point. Hence the algorithm’s performance could change if we consider its 100 nearest-neighbors or its single nearest-neighbor.
無(wú)論哪種情況,我們幾乎總是希望獲得高精度。 在我們的情況下,根據(jù)算法正確確定給定點(diǎn)的10個(gè)最近鄰居的能力來(lái)評(píng)估準(zhǔn)確性或召回率。 因此,如果我們考慮它的100個(gè)最近鄰居或單個(gè)最近鄰居,算法的性能可能會(huì)改變。
摘要: (The Summary:)
What will you pick for your next project? Photo by Franck V. on Unsplash您將為下一個(gè)項(xiàng)目選擇什么? Franck V.在Unsplash上的照片Based on our findings from this benchmark experiment there are clear benefits to using some algorithms as opposed to others. The key to picking an optimal ANN algorithm is understanding what about the algorithm you want to prioritize and what engineering tradeoffs you are comfortable with. I recommend you prioritize what fits your circumstances, be that speed (QPS), accuracy (Recall), or pre-processing (Build time). It is worth noting algorithms that perform with less than 90% Recall aren’t worth discussing. This is because 90% is considered to be the minimum level of performance when conducting nearest-neighbor search. Anything less than 90% is underperforming and likely not useful.
根據(jù)我們從基準(zhǔn)測(cè)試中獲得的發(fā)現(xiàn),使用某些算法相對(duì)于其他算法具有明顯的好處。 選擇最佳ANN算法的關(guān)鍵是了解要確定優(yōu)先級(jí)的算法是什么,以及需要進(jìn)行哪些工程折衷。 我建議您優(yōu)先考慮適合您的情況的速度,即速度(QPS),準(zhǔn)確性(調(diào)用)或預(yù)處理(構(gòu)建時(shí)間)。 值得注意的是,調(diào)用率不到90%的算法不值得討論。 這是因?yàn)樵趫?zhí)行最近鄰居搜索時(shí),90%被認(rèn)為是最低性能。 少于90%的廣告效果不佳,可能沒(méi)有用。
With that said my recommendations are as follows:
話雖如此,我的建議如下:
For projects where speed is a priority, our results suggest that algorithms such as HNSWlib and NGT-panng perform accurately with a greater number of queries per second than alternative choices.
對(duì)于優(yōu)先考慮速度的項(xiàng)目,我們的結(jié)果表明,與其他選擇相比,諸如HNSWlib和NGT-panng之類的算法每秒執(zhí)行的查詢數(shù)量更高, 因此能夠準(zhǔn)確執(zhí)行。
For Projects where accuracy is a priority, our results suggest that algorithms such as Faiss-IVF and SW-graph prioritize higher Recall scores, whilst still performing quickly.
對(duì)于以準(zhǔn)確性為優(yōu)先的項(xiàng)目,我們的結(jié)果表明,諸如Faiss-IVF和SW-graph之類的算法會(huì)優(yōu)先考慮較高的查全率,同時(shí)仍能快速執(zhí)行。
For projects where pre-processing is a priority, our results suggest that algorithms such as Faiss-IVF and Annoy exhibit exceptionally fast build times whilst still balancing accuracy and speed.
對(duì)于需要優(yōu)先處理的項(xiàng)目,我們的結(jié)果表明,諸如Faiss-IVF和Annoy之類的算法顯示出異常快的構(gòu)建時(shí)間,同時(shí)仍然在準(zhǔn)確性和速度之間取得了平衡。
Considering the circumstances of our experiment, there are a variety of different scenarios where some algorithms may perform better than others. In our case, we have tried to perform in the most generic and common of circumstances. We used a large dataset with high, but not excessively high, dimensionality to help indicate how these algorithms may perform on sets with similar specifications. For some of these algorithms, more tweaking and experimentation may lead to marginal improvements in runtime and accuracy. However, given the scope of this project it would be excessive to attempt to accomplish this with each algorithm.
考慮到我們的實(shí)驗(yàn)環(huán)境,在許多不同的情況下,某些算法的性能可能會(huì)優(yōu)于其他算法。 在我們的案例中,我們?cè)噲D在最普通和最常見(jiàn)的情況下執(zhí)行。 我們使用了一個(gè)具有高(但不是過(guò)高)維的大型數(shù)據(jù)集,以幫助指示這些算法如何在具有相似規(guī)格的集合上執(zhí)行。 對(duì)于其中一些算法,更多的調(diào)整和實(shí)驗(yàn)可能會(huì)導(dǎo)致運(yùn)行時(shí)和準(zhǔn)確性的輕微改善。 但是,鑒于該項(xiàng)目的范圍,嘗試使用每種算法來(lái)完成此任務(wù)將是多余的。
If you are interested in learning more about Bernhardsson’s project I recommend reading some of my other blogs on the topic. If you are interested in looking at the full CSV file of results from this benchmark, it is available on my GitHub here.
如果您有興趣了解有關(guān)Bernhardsson的項(xiàng)目的更多信息,建議閱讀我有關(guān)該主題的其他博客。 如果您有興趣查看此基準(zhǔn)測(cè)試結(jié)果的完整CSV文件,請(qǐng)?jiān)谖业腉itHub上此處獲取 。
未來(lái)的工作: (Future Work:)
Whilst this is a good starting point for picking ANN algorithms there are still a number of alternative conditions to consider. Going forward I would like to explore how batch performance impacts our results and whether different algorithms perform better when batching is included. Additionally, I suspect that some algorithms will perform better when querying for different numbers of nearest-neighbors. In this project, we chose 10 nearest neighbors, however, our results could shift when querying for 100 neighbors or just the top 1 nearest-neighbor.
雖然這是選擇ANN算法的一個(gè)很好的起點(diǎn),但仍然需要考慮許多替代條件。 展望未來(lái),我想探討批處理性能如何影響我們的結(jié)果以及包括批處理時(shí)不同算法的性能是否更好。 另外,我懷疑在查詢不同數(shù)量的最近鄰居時(shí)某些算法的性能會(huì)更好。 在該項(xiàng)目中,我們選擇了10個(gè)最近的鄰居,但是,當(dāng)查詢100個(gè)鄰居或僅搜索前1個(gè)最近的鄰居時(shí),結(jié)果可能會(huì)發(fā)生變化。
附錄: (Appendix:)
Computer specifications: 1U GPU Server 1 2 Intel CD8067303535601 Xeon? Gold 5115 2 3 Kingston KSM26RD8/16HAI 16GB 2666MHz DDR4 ECC Reg CL19 DIMM 2Rx8 Hynix A IDT 4 4 Intel SSDSC2KG960G801 S4610 960GB 2.5" SSD.
計(jì)算機(jī)規(guī)格: 1U GPU服務(wù)器1 2 Intel CD8067303535601Xeon?Gold 5115 2 3 Kingston KSM26RD8 / 16HAI 16GB 2666MHz DDR4 ECC Reg CL19 DIMM 2Rx8 Hynix A IDT 4 4 Intel SSDSC2KG960G801 S4610 960GB 2.5“ SSD。
Link to How to Benchmark ANN Algorithms: https://medium.com/gsi-technology/how-to-benchmark-ann-algorithms-a9f1cef6be08
鏈接到如何對(duì)ANN算法進(jìn)行基準(zhǔn)測(cè)試: https : //medium.com/gsi-technology/how-to-benchmark-ann-algorithms-a9f1cef6be08
Link to ANN Benchmarks: A Data Scientist’s Journey to Billion Scale Performance: https://medium.com/gsi-technology/ann-benchmarks-a-data-scientists-journey-to-billion-scale-performance-db191f043a27
鏈接到ANN基準(zhǔn):數(shù)據(jù)科學(xué)家的十億規(guī)模績(jī)效之旅: https : //medium.com/gsi-technology/ann-benchmarks-a-data-scientists-journey-to-billion-scale-performance-db191f043a27
Link to CSV file that includes benchmark results: https://github.com/Briggs599/Deep1b-benchmark-results
鏈接到包含基準(zhǔn)測(cè)試結(jié)果的CSV文件: https : //github.com/Briggs599/Deep1b-benchmark-results
資料來(lái)源: (Sources:)
Aumüller, Martin, Erik Bernhardsson, and Alexander Faithfull. “ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms.” International Conference on Similarity Search and Applications. Springer, Cham, 2017.
Aumüller,Martin,Erik Bernhardsson和Alexander Faithfull。 “ ANN基準(zhǔn):用于近似最近鄰算法的基準(zhǔn)測(cè)試工具。” 國(guó)際相似性搜索及其應(yīng)用會(huì)議 。 占卜·斯普林格,2017年。
Deep billion-scale indexing. (n.d.). Retrieved July 21, 2020, from http://sites.skoltech.ru/compvision/noimi/
十億規(guī)模的深索引。 (nd)。 于2020年7月21日從http://sites.skoltech.ru/compvision/noimi/檢索
Liu, Ting, et al. “An investigation of practical approximate nearest neighbor algorithms.” Advances in neural information processing systems. 2005.
劉婷,等。 “研究實(shí)用的近似最近鄰算法。” 神經(jīng)信息處理系統(tǒng)的研究進(jìn)展 。 2005。
翻譯自: https://towardsdatascience.com/a-data-scientists-guide-to-picking-an-optimal-approximate-nearest-neighbor-algorithm-6f91d3055115
近似算法的近似率
總結(jié)
以上是生活随笔為你收集整理的近似算法的近似率_选择最佳近似最近算法的数据科学家指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 在加利福尼亚州投资于新餐馆:一种数据驱动
- 下一篇: 如何回复别人梦到自己