當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

局部敏感哈希算法

發(fā)布時間：2023/12/31 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了局部敏感哈希算法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

這篇文章介紹了局部敏感哈希算法，局部敏感哈希是非監(jiān)督的哈希算法。?
算法的輸入是實數(shù)域的特征向量，輸出為一個binary vector。?
利用哈希函數(shù)將數(shù)據(jù)點映射到不同的桶中是一種保形映射，使得數(shù)據(jù)點?

本文主要做關(guān)鍵部分的代碼解析。

入口函數(shù)lsh

T1=lsh('lsh',20,24,size(patches,1),patches,'range',255);

第一個參數(shù)是使用的算法的類型，包括兩種類型，分別是lsh和e2lsh?
生成一個range的參數(shù)，得到的[0 0 ,…0; 255 255 ,….,255]這樣的形式

range = processRange(d,range);

這個函數(shù)是用來產(chǎn)生lsh函數(shù)的。

Is = lshfunc(type,l,k,d,varargin{:});

l表示函數(shù)的個數(shù)，k表示一個函數(shù)中的位數(shù)，d表示數(shù)據(jù)的維度。

for j=1:l% select random dimensionsI(j).d = include(unidrnd(length(include),1,k)); % 均勻分布的，隨機選中k維% for each dimension select a threshold % hash key = [[ x(:,d)' >= t ]] t = unifrnd(0,1,1,k).*(range(2,I(j).d)-range(1,I(j).d)); %每一維都隨機選中一個閾值位于0~255之間 I(j).t = range(1,I(j).d)+t; I(j).k = k; end

這里hash函數(shù)就是一個簡單閾值函數(shù)，將原始的400維的數(shù)據(jù)，隨機選出k=24維，變?yōu)?到1，后文會有進一步說明。l為總共生成的哈希函數(shù)的數(shù)目，這里取值為20。?
產(chǎn)生Is的變量的內(nèi)容如下：

d是選擇的維度下標(biāo)，t是維度的閾值。

T = lshprep(type,Is,b);

T這個變量存儲了哈希查找哈希值以及索引信息。

T(j).type = type;T(j).Args = varargin;T(j).I = Is(j);T(j).B = B; T(j).count = 0; T(j).buckets = []; % prepare T's table T(j).Index = {}; T(j).verbose=1; % set up secondary hash table for buckets % max. index can be obtained by running lshhash on max. bucket T(j).bhash = cell(lshhash(ones(1,k)*255),1); % lshhash是一個計算hash值的函數(shù)，將24維的二值向量映射為一個哈希值

隨后的函數(shù)，將數(shù)據(jù)放入桶中，對T中變量進行賦值。

T = lshins(T,x,ind);

這個函數(shù)中有一些關(guān)鍵的處理，其中

buck = findbucket(T(j).type,x,T(j).I);%這是一個將數(shù)據(jù)轉(zhuǎn)化為二值向量的函數(shù)

它里面的主要采用了矩陣的比較，本質(zhì)上就是用剛才生成的閾值函數(shù)做了一個二值化。?
其中v是一個59500*24維的二值矩陣，每一行表示一個數(shù)據(jù)樣本。

v = x(I.d,:)' <= repmat(I.t,size(x,2),1);v = uint8(v+128);

但注意，輸出的d維二值向量每一維并不是[0， 1]，而在區(qū)間[128 129]，這可能是要用于后文二次哈希的計算方便。為了后文方便說明，我們用哈希向量來簡稱這個二值向量。

這里一個桶buck對應(yīng)著一個哈希向量，但是桶的數(shù)目非常多，直接來進行比較是很費時間的。

[uniqBuck,ib,bID] = unique(buck,'rows');keys = lshhash(uniqBuck);%返回每個桶的哈希key值

例如，對j=1這個哈希函數(shù)而言，總共有14615個不同的桶（新分配空間為14615*24），如果要查找一個桶就需要14615次比較非常費時。作者的優(yōu)化方案是進行二次哈希，讓多個哈希向量映射為一個整型的hash-key值，用lshhash函數(shù)完成此功能。

% allocate space for new buckets -- possibly excessiveT(j).buckets=[T(j).buckets; zeros(length(ib),T(j).I.k,'uint8')];

對每一個單獨的哈希key值ib(b)

% find which data go to bucket uniqBuck(b)thisBucket = find(bID==bID(ib(b)));% find out if this bucket already has anything% first, which bucket is it? 該hash函數(shù)T(j)下的，對應(yīng)于哈希key值keys(b)的桶是否已經(jīng)存在ihash = T(j).bhash{keys(b)}; % possible matching buckets if (isempty(ihash)) % nothing matches isb = []; else % may or may not match isb = ihash(find(all(bsxfun(@eq,uniqBuck(b,:),T(j).buckets(ihash,:)),2))); end

其中

isb = ihash(find(all(bsxfun(@eq,uniqBuck(b,:),T(j).buckets(ihash,:)),2)));

是一種非常有效的寫法，bsxfun(@eq ,a,b)這種形式會得到兩個向量之間的逐位比較，它matlab內(nèi)部的實現(xiàn)是通過循環(huán)來實現(xiàn)的。通過all在水平方向上進行判別，

就相當(dāng)于比較兩個向量是否相等。這一步是比較在T(j).bhash中存放的哈希向量中是否已經(jīng)存在當(dāng)前的獲得的哈希向量，即是否已經(jīng)記錄了當(dāng)前的桶，這樣我們就

可以分情況討論是往這個桶里添加新的數(shù)據(jù)，還是要先創(chuàng)建一個桶再添加新的數(shù)據(jù)。

if (~isempty(isb)) % 如果isb不為空，那么即該bucket已經(jīng)存在% adding to an existing bucket.oldcount=length(T(j).Index{isb}); % # elements in the bucket prior % to addition 添加前桶中元素的數(shù)目，主要是方便統(tǒng)計 newIndex = [T(j).Index{isb} ind(thisBucket)]; else % creating new bucket newBuckets=newBuckets+1; oldcount=0; isb = oldBuckets+newBuckets; T(j).buckets(isb,:)=uniqBuck(b,:);%為什么用128 129表示 T(j).bhash{keys(b)} = [T(j).bhash{keys(b)}; isb];%根據(jù)hash-key值來映射桶序號 newIndex = ind(thisBucket);%該桶中存放的元素的下標(biāo) end

隨后完成信息的更新

% if there is a bound on bucket capacity, and the bucket is full, % keep a random subset of B elements (note: we do this rather than % simply skip the new elements since that could introduce bias % towards older elements.) % There is still a bias since older elements have more chances to get % thrown out. if (length(newIndex) > T(j).B) rp=randperm(length(newIndex)); newIndex = newIndex(rp(1:T(j).B));% 如果超過的了桶的容量限制，那么隨機選定T(j).B個數(shù)據(jù) end % ready to put this into the table T(j).Index{isb}= newIndex;%重新為屬于該桶的數(shù)據(jù)下標(biāo)賦值 % update distinct element count T(j).count = T(j).count + length(newIndex)-oldcount; %新數(shù)目減去老數(shù)目為改變量，注意如果以前桶中有元素，是通過追加的方式添加上去的，在追加后再與T(j).B進行比較。作者這么做，就是為了保證桶中元素不會因為滿了而傾向于保持老元素，新元素就加不進去了，所以先追加后然后再隨機選擇指定數(shù)目保留下來。當(dāng)然這樣做還是會造成桶中舊的元素更容易被扔掉這一情形。

運行分析

運行l(wèi)sh函數(shù)會得到：

Table 5 adding 13852 buckets (now 13852) Table 5: 59500 elements 12619 distinct buckets Table 6 adding 12619 buckets (now 12619) Table 6: 59500 elements 11936 distinct buckets Table 7 adding 11936 buckets (now 11936) Table 7: 59500 elements 15997 distinct buckets

參數(shù)查看 lshstats

examine statistics of LSH data structure

[mi,ma,me]=lshstats(T,B,xref,xtst,minNN)

例如；

lshstats(T1(1:5),'test',patches,patches(:,1:1000),2);

輸出為?
Table 1: 59500 in 13404 bkts, med 1, max 4288, avg 813.19?
Table 2: 59500 in 12661 bkts, med 1, max 2646, avg 544.55?
Table 3: 59500 in 16147 bkts, med 1, max 4057, avg 751.01?
Table 4: 59500 in 11627 bkts, med 1, max 4989, avg 864.60?
Table 5: 59500 in 13630 bkts, med 1, max 3528, avg 601.55

這表示table1有13404 個桶，平均容量是每個桶1個數(shù)據(jù)，最大容量為4288，期望容量為813.19

Running test…10% 20% 30% 40% 50% 60% 70% 80% 90% 100%?
# of comparisons: mean 980.14, max 8122, failures: 54

這里使用了5個哈希函數(shù)，它的含義是對前1000個樣本進行查找，平均每次查找需要比較980個樣本，但是同時失敗次數(shù)為54次

如果增加哈希函數(shù)的數(shù)目，會得到不同的結(jié)果，根據(jù)參考文獻中的分析，如果增加哈希函數(shù)的數(shù)目，那么會需要更長的查找時間，但是同時recall將會增加，例如這里我們用全部的20個哈希函數(shù)來做實驗。

lshstats(T1,'test',patches,patches(:,1:1000),2);

得到結(jié)果?
Running test…10% 20% 30% 40% 50% 60% 70% 80% 90% 100%?
# of comparisons: mean 2957.24, max 13120, failures: 2?
可以發(fā)現(xiàn)平均查找所需的時間變長了，但是recall相應(yīng)的變高的(幾乎沒有錯誤)。

lshlookup

下面是查找第50個樣本，在這之前，首先增加二值向量的長度，即引用文獻中的b的長度，這會減少平均每個桶中的元素數(shù)目

lshstats(T2(1:10),'test',patches,patches(:,1:1000),2);

Table 1: 59500 in 33066 bkts, med 1, max 1829, avg 146.51?
Table 2: 59500 in 34018 bkts, med 1, max 1638, avg 160.95?
Table 3: 59500 in 34077 bkts, med 1, max 1386, avg 156.09?
Table 4: 59500 in 35716 bkts, med 1, max 2813, avg 210.50?
Table 5: 59500 in 34492 bkts, med 1, max 1470, avg 194.75?
Table 6: 59500 in 34659 bkts, med 1, max 1543, avg 156.86?
Table 7: 59500 in 33033 bkts, med 1, max 1232, avg 146.30?
Table 8: 59500 in 33923 bkts, med 1, max 1955, avg 152.32?
Table 9: 59500 in 34032 bkts, med 1, max 1718, avg 176.25?
Table 10: 59500 in 32402 bkts, med 1, max 2862, avg 226.41

注意avg變小了

tic; [nnlsh,numcand]=lshlookup(patches(:,50),patches,T2,'k',11,'distfun','lpnorm','distargs',{1});toc

算法運行結(jié)果結(jié)果實現(xiàn)檢索一個數(shù)據(jù)所需的時間：

時間已過 0.030697 秒。

下面來解析這個函數(shù)的實現(xiàn)?
需要完成的任務(wù)是找到所有match這個query的tables。?
步驟1 用哈希函數(shù)T(j)獲取查詢x0的映射的50維(維度為哈希函數(shù)中隨機選定的位數(shù)的長度，即b)二值向量，由于加了128，所以范圍是在[128,129]。

buck = findbucket(T(j).type,x0,T(j).I);

步驟2 將該向量轉(zhuǎn)化成哈希key，這一步不是一一映射，而是多對一的映射，主要目的是為了提升向量的檢索速度。

key = lshhash(buck);

步驟3 根據(jù)哈希key值獲取所有的哈希向量，一個哈希key值對應(yīng)著多個bucket

ihash = T(j).bhash{key}; % possible matching buckets

步驟4 進一步查找到該哈希向量，即找到對應(yīng)的桶

if (~isempty(ihash)) % nothing matchesb = ihash(find(all(bsxfun(@eq,buck,T(j).buckets(ihash,:)),2))); if (~isempty(b)) iNN = [iNN T(j).Index{b}]; %把該桶中的數(shù)據(jù)union起來，因為不同的哈希函數(shù)會有不同的結(jié)果 end end

步驟5?
去除重復(fù)數(shù)據(jù)

[iNN,iu]=unique(iNN); cand = length(iNN);

步驟6?
這一步主要是將相似列表中的數(shù)據(jù)做個排序返回。用于CBIR檢索很合適。

if (~isempty(iNN))if (strcmp(sel,'best'))D=feval(distfun,x0,Xsel(x,iNN),distargs{:});% 即比較這些桶中的最近鄰數(shù)據(jù)和query的距離[dist,sortind]=sort(D);ind = find(dist(1:min(k,length(dist)))<=r);%返回小于指定距離的下標(biāo)，基于iNN iNN=iNN(sortind(ind));% 返回相似數(shù)據(jù)，這就完成了檢索 else % random rp=randperm(cand); choose=[]; for i=1:length(rp) d = feval(distfun,x0,Xsel(x,iNN(rp(i))),distargs{:}); if (d <= r) choose = [choose iNN(rp(i))]; if (length(choose) == k) break; end end end iNN = choose; end end

轉(zhuǎn)載于:https://www.cnblogs.com/wt869054461/p/5754888.html

總結(jié)

以上是生活随笔為你收集整理的局部敏感哈希算法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：电磁仿真原理——4. 矩量法（Momen
下一篇：神器！人工智能分离歌曲中的人声和背景音乐