Approximate Nearest Neighbors.接近最近邻搜索
(一):次優最近鄰:http://en.wikipedia.org/wiki/Nearest_neighbor_search
有少量修改;如有疑問,請看鏈接原文.....
1.Survey:
Nearest neighbor search (NNS), also known as proximity search,similarity search orclosest point search, is anoptimization problem for finding closest (or most similar) points. Closeness is typically expressed in terms of a dissimilarity function: The less similar are the objects the larger are the function values. Formally, the nearest-neighbor (NN) search problem is defined as follows: given a setS of points in a spaceM and a query pointq?∈?M, find the closest point inS toq. Donald Knuth in vol. 3 of The Art of Computer Programming (1973) called it thepost-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is ak-NN search, where we need to find thek closest points.尋找高維近鄰點的最優化問題;
Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensionalvector space where dissimilarity is measured using theEuclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example are asymmetricBregman divergences, for which the triangle inequality does not hold.[1]
距離矩陣的度量問題;
2.方法:(遇到的問題:維數災難;)
Linear search(適合小范圍的距離計算)
The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(Nd) where N is the cardinality of S and d is the dimensionality of M. There are no search data structures to maintain, so linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces.[2]
Space partitioning(決策樹?)
Since the 1970s, branch and bound methodology has been applied to the problem. In the case of Euclidean space this approach is known asspatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is thek-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity isO(log?N)[3] in the case of randomly distributed points, worst case complexity analyses have been performed.[4] Alternatively theR-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions.
In case of general metric space branch and bound approach is known under the name ofmetric trees. Particular examples include vp-tree and BK-tree.
Using a set of points taken from a 3-dimensional space and put into aBSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result.
空間劃分是一個構建空間樹的過程,其構建過程比較復雜,涉及到大量的計算;
Locality sensitive hashing(hash過程,可以近似O(1)的時間查詢表)
Locality sensitive hashing (LSH) is a technique for grouping points in space into 'buckets' based on some distance metric operating on the points. Points that are close to each other under the chosen metric are mapped to the same bucket with high probability.[5]
Nearest neighbor search in spaces with small intrinsicdimension
The cover tree has a theoretical bound that is based on the dataset's doubling constant. The bound on search time is O(c12?log?n) wherec is theexpansion constant of the dataset.
Vector approximation files
In high dimensional spaces, tree indexing structures become useless because an increasing percentage of the nodes need to be examined anyway. To speed up linear search, a compressed version of the feature vectors stored in RAM is used to prefilter the datasets in a first run. The final candidates are determined in a second stage using the uncompressed data from the disk for distance calculation.[6]
Compression/clustering based search
The VA-file approach is a special case of a compression based search, where each feature component is compressed uniformly and independently. The optimal compression technique in multidimensional spaces is Vector Quantization (VQ), implemented through clustering. The database is clustered and the most "promising" clusters are retrieved. Huge gains over VA-File, tree-based indexes and sequential scan have been observed.[7][8] Also note the parallels between clustering and LSH.
3.次優最近鄰
Algorithms that support the approximate nearest neighbor search includelocality-sensitive hashing,best bin first andbalanced box-decomposition tree based search.[9]
(1):ε-approximate nearest neighbor search is a special case of thenearest neighbor search problem. The solution to the ε-approximate nearest neighbor search is a point or multiple points within distance (1+ε) R from a query point, where R is the distance between the query point and its true nearest neighbor.
Reasons to approximate nearest neighbor search include the space and time costs of exact solutions in high dimensional spaces (seecurse of dimensionality) and that in some domains, finding an approximate nearest neighbor is an acceptable solution.
Approaches for solving ε-approximate nearest neighbor search includekd-trees,Locality Sensitive Hashing andbrute force search.
(2):Best bin first is a search algorithm that is designed to efficiently find an approximate solution to the nearest neighbor search problem in very-high-dimensional spaces. The algorithm is based on a variant of thekd-tree search algorithm which makes indexing higher dimensional spaces possible. Best bin first is an approximate algorithm which returns the nearest neighbor for a large fraction of queries and a very close neighbor otherwise.[1]
Differences from kd tree:
- Backtracking is according to a priority queue based on closeness.
- Search a fixed number of nearest candidates and stop.
- A speedup of two orders of magnitude is typical.??????????????????????????????????????? 主要是對于超大型數據庫進行相似性查詢;
References:Beis, J.; Lowe, D. G. (1997). "Shape indexing using approximate nearest-neighbour search in high-dimensional spaces". Conference on Computer Vision and Pattern Recognition. Puerto Rico. pp.?1000–1006.CiteSeerX:10.1.1.23.9493.
(3):LSH:http://en.wikipedia.org/wiki/Locality_sensitive_hashingLocality-sensitive hashing (LSH) is a method of performing probabilisticdimension reduction of high-dimensional data. The basic idea is tohash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items). This is different from the conventional hash functions, such as those used incryptography, as in the LSH case the goal is to maximize probability of "collision" of similar items rather than avoid collisions.[1] Note how locality-sensitive hashing, in many ways, mirrors data clustering andNearest neighbor search.
(二):局部敏感hash:
(三):minihash:原文鏈接:http://my.oschina.net/pathenon/blog/65210??
轉自wiki:http://en.wikipedia.org/wiki/Locality_sensitive_hashing
??? 傳統的hash算法只負責將原始內容盡量均勻隨機地映射為一個簽名值,原理上相當于偽隨機數產生算法。傳統hash算法產生的兩個簽名,如果相等,說明原始內容在一定概率下是相等的;如果不相等,除了說明原始內容不相等外,不再提供任何信息,因為即使原始內容只相差一個字節,所產生的簽名也很可能差別極大。從這個意義上來說,要設計一個hash算法,對相似的內容產生的簽名也相近,是更為艱難的任務,因為它的簽名值除了提供原始內容是否相等的信息外,還能額外提供不相等的原始內容的差異程度的信息。 ??? MinHash[1最小哈希方法可看作是局部性敏感哈希的一個實例。局部性敏感哈希是使用哈希將大集合的數據對象映射到更小的哈希值的技術集合,通過這樣的方法當兩個對象距離相近時,它們的哈希值也可以相同。在最小哈希方法實例中,一個集合的簽名可看作是它的哈希值。其它局部性敏感哈希技術還有針對集合間的海明距離,以及向量間的余弦距離等。另外,局部性敏感哈希還在最近鄰搜索算法有著重要的應用。[9]
1.概述introduction:
????跟SimHash一樣,MinHash也是LSH的一種,可以用來快速估算兩個集合的相似度。MinHash由Andrei Broder提出,最初用于在搜索引擎中檢測重復網頁。它也可以應用于大規模聚類問題。
2.Jaccard index:雅可比相似度與最小哈希值
? ??? 在介紹MinHash之前,我們先介紹下Jaccard index。
? ?? Jaccard index是用來計算相似性,也就是距離的一種度量標準。假如有集合A、B,那么,
????
? ??? 也就是說,集合A,B的Jaccard系數等于A,B中共同擁有的元素數與A,B總共擁有的元素數的比例。很顯然,Jaccard系數值區間為[0,1]。????
?????? 假定h是一個將A和B中的元素映射到一些不相交整數的哈希函數,而且針對給定的S,定義hmin(S)為S集合中具有最小h(x)函數值的元素x。這樣,只有當最小哈希值的并集A ∪ B依賴于交集A ∩B時,有hmin(A) =hmin(B)。 因此,
??????? 另一方面來說,如果r是一個當hmin(A) =hmin(B)時值為1,其它情況下值為0的隨機變量,那么r可認為是J(A,B)的無偏估計。盡管此時方差過高,單獨使用時沒什么用處。最小哈希方法的思想是通過平均用同一方式構造的許多隨機變量,從而減少方差。
3.MinHash:多哈希函數,單一哈希函數:
? ? 先定義幾個符號術語: ? ? h(x): ?把x映射成一個整數的哈希函數。 ?? ? ??hmin(S):集合S中的元素經過h(x)哈希后,具有最小哈希值的元素。
? ??那么對集合A、B,hmin(A) =?hmin(B)成立的條件是A?∪?B?中具有最小哈希值的元素也在?∩?B中。這里有一個假設,h(x)是一個良好的哈希函數,它具有很好的均勻性,能夠把不同元素映射成不同的整數。
? ? 所以有,Pr[hmin(A) =?hmin(B)] =?J(A,B),即集合A和B的相似度為集合A、B經過hash后最小哈希值相等的概率。
?? 有了上面的結論,我們便可以根據MinHash來計算兩個集合的相似度了。一般有兩種方法:第一種:使用多個hash函數
? ? ? ? 為了計算集合A、B具有最小哈希值的概率,我們可以選擇一定數量的hash函數,比如K個。然后用這K個hash函數分別對集合A、B求哈希值,對 每個集合都得到K個最小值。比如Min(A)k={a1,a2,...,ak},Min(B)k={b1,b2,...,bk}。 ? ? ? ? 那么,集合A、B的相似度為|Min(A)k?∩?Min(B)k| / |Min(A)k??∪??Min(B)k|,及Min(A)k和Min(B)k中相同元素個數與總的元素個數的比例。
第二種:使用單個hash函數
? ? 第一種方法有一個很明顯的缺陷,那就是計算復雜度高。使用單個hash函數是怎么解決這個問題的呢?請看: ? ?前面我們定義過?hmin(S)為集合S中具有最小哈希值的一個元素,那么我們也可以定義hmink(S)為集合S中具有最小哈希值的K個元素。這樣一來, 我們就只需要對每個集合求一次哈希,然后取最小的K個元素。計算兩個集合A、B的相似度,就是集合A中最小的K個元素與集合B中最小的K個元素 的交集個數與并集個數的比例。 ? ?????? ? ?? 看完上面的,你應該大概清楚MinHash是怎么回事了。但是,MinHash的好處到底在哪里呢?計算兩篇文檔的相似度,就直接統計相同的詞數和總的 次數,然后就Jaccard index不就可以了嗎?對,如果僅僅對兩篇文檔計算相似度而言,MinHash沒有什么優勢,反而把問題復雜化了。但是如果有海量的文檔需要求相似度,比如在推薦系統 中計算物品的相似度,如果兩兩計算相似度,計算量過于龐大。下面我們看看MinHash是怎么解決問題的。
? ?? 比如元素集合{a,b,c,d,e},其中s1={a,d},s2={c},s3={b,d,e},s4={a,c,d}?那么這四個集合的矩陣表示為:? ? ?? ? ???? 如果要對某一個集合做MinHash,則可以從上面矩陣的任意一個行排列中選取一個,然后MinHash值是排列中第一個1的行號。 ? ? 例如,對上述矩陣,我們選取排列 beadc,那么對應的矩陣為 ? ? ? ?? ? ????? 那么, h(S1) = a,同樣可以得到h(S2) = c, h(S3) = b, h(S4) = a。 ? ? ? ? 如果只對其中一個行排列做MinHash,不用說,計算相似度當然是不可靠的。因此,我們要選擇多個行排列來計算MinHash,最后根據Jaccard index公式來計算相似度。但是求排列本身的復雜度比較高,特別是針對很大的矩陣來說。因此,我們可以設計一個隨機哈希函數去模擬排列,能夠把行號0~n隨機映射到0~n上。比如H(0)=100,H(1)=3...。當然,沖突是不可避免的,沖突后可以二次散列。并且如果選取的隨機哈希函數夠均勻,并且當n較大時,沖突發生的概率還是比較低的。
? ????? 說到這里,只是討論了用MinHash對海量文檔求相似度的具體過程,但是它到底是怎么減少復雜度的呢? ? ? ? ? 比如有n個文檔,每個文檔的維度為m,我們可以選取其中k個排列求MinHash,由于每個對每個排列而言,MinHash把一篇文檔映射成一個整數,所以對k個排列計算MinHash就得到k個整數。那么所求的MinHash矩陣為n*k維,而原矩陣為n*m維。n>>m時,計算量就降了下來。 ? ??
4.參考文獻 ? ?? (1):http://en.wikipedia.org/wiki/MinHash ? ? ? ? ? ? ? (2)? :http://fuliang.iteye.com/blog/1025638
????????????? (3):^Chum, Ond?ej; Philbin, James; Isard, Michael; Zisserman, Andrew, Scalable near identical image and shot detection, Proceedings of the 6th ACM International Conference on Image and Cideo Retrieval (CIVR'07). 2007,doi:10.1145/1282280.1282359;Chum, Ond?ej; Philbin, James; Zisserman, Andrew,Near duplicate image detection: min-hash and tf-idf weighting, Proceedings of the British Machine Vision Conference,
總結
以上是生活随笔為你收集整理的Approximate Nearest Neighbors.接近最近邻搜索的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 银行卡签约账户是什么意思
- 下一篇: Cython入门.VS.C++