當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Node2vec原理剖析，代码实现

發(fā)布時(shí)間：2023/12/31 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 Node2vec原理剖析，代码实现小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

DeepWalk原理介紹

與詞嵌入類似，圖嵌入基本理念是基于相鄰頂點(diǎn)的關(guān)系，將目的頂點(diǎn)映射為稠密向量，以數(shù)值化的方式表達(dá)圖中的信息，以便在下游任務(wù)中運(yùn)用。

Word2Vec根據(jù)詞與詞的共現(xiàn)關(guān)系學(xué)習(xí)向量的表示，DeepWalk受其啟發(fā)。它通過隨機(jī)游走的方式提取頂點(diǎn)序列，再用Word2Vec模型根據(jù)頂點(diǎn)和頂點(diǎn)的共現(xiàn)關(guān)系，學(xué)習(xí)頂點(diǎn)的向量表示?？梢岳斫鉃橛梦淖职褕D的內(nèi)容表達(dá)出來，如下圖所示。

DeepWalk訓(xùn)練圖表示的整個(gè)過程大致可以分為2步：

隨機(jī)游走提取頂點(diǎn)序列
使用skip-gram學(xué)習(xí)頂點(diǎn)嵌入

訓(xùn)練時(shí)采用層次Softmax（Hierarchical Softmax）優(yōu)化算法，避免計(jì)算所有詞的softmax。

Node2vec原理

DeepWalk不適用于有權(quán)圖，它無法學(xué)習(xí)邊上的權(quán)重信息。Node2Vec可以看作DeepWalk的擴(kuò)展，它學(xué)習(xí)嵌入的過程也可以分兩步：

二階隨機(jī)游走（2ndorderrandomwalk）
使用skip-gram學(xué)習(xí)頂點(diǎn)嵌入

可以看到與DeepWalk的區(qū)別就在于游走的方式，在二階隨機(jī)游走中，轉(zhuǎn)移概率 $π_{vx}$ 受權(quán)值 $w_{vx}$ 影響（無權(quán)圖中 $w_{vx}$ 為1）：
$πvx=αpq(t,x)?wvx\pi_{vx}=\alpha_{pq}(t,x) \cdot w_{vx}$

$αpq(t,x)={1p,ifdtx=01,ifdtx=11q,ifdtx=2\alpha_{pq}(t,x)=\left\{ \begin{aligned} \frac{1}{p}, \quad & if\quad{d_{tx}=0} \\ 1, \quad & if\quad{d_{tx}=1} \\ \frac{1}{q}, \quad & if\quad{d_{tx}=2} \end{aligned} \right.$

t 代表上一個(gè)節(jié)點(diǎn)，v 代表當(dāng)前節(jié)點(diǎn)，x 代表下一個(gè)準(zhǔn)備訪問的節(jié)點(diǎn)。 $d_{tx}$ 代表上一個(gè)節(jié)點(diǎn)與待訪問節(jié)點(diǎn)的距離。 $d_{tx} = 0$ 代表從當(dāng)前節(jié)點(diǎn)返回上一個(gè)訪問節(jié)點(diǎn)，即“t->v->t”。

算法通過p、q兩個(gè)超參數(shù)來控制游走到不同頂點(diǎn)的概率。

q：被稱作進(jìn)出參數(shù)，控制“向內(nèi)”還是“向外”游走。若q>1，傾向于訪問與 t 接近的頂點(diǎn)，若 q<1 則傾向于訪問遠(yuǎn)離 t 的頂點(diǎn)。
p：被稱為返回參數(shù)，控制重復(fù)訪問剛剛訪問過的頂點(diǎn)的概率。若設(shè)置的值較大，就不大會(huì)剛問剛剛訪問過的頂點(diǎn)。若設(shè)置的值較小，那就可能回路返回一步。

p,q的大小可以控制算法是偏向于DFS還是BFS。p越小，隨機(jī)游走回節(jié)點(diǎn)t的可能性越大，算法傾向于表達(dá)結(jié)構(gòu)性；q越小，隨機(jī)游走到遠(yuǎn)方的可能性越大，網(wǎng)絡(luò)傾向于表達(dá)同質(zhì)性。

也即，BFS偏向于表達(dá)結(jié)構(gòu)性，DFS傾向于表達(dá)同質(zhì)性。

這里解釋一下，網(wǎng)絡(luò)的“同質(zhì)性”指的是距離相近節(jié)點(diǎn)的embedding應(yīng)該盡量近似，如圖，節(jié)點(diǎn)u與其相連的節(jié)點(diǎn)s1、s2、s3、s4的embedding表達(dá)應(yīng)該是接近的，這就是“同質(zhì)性“的體現(xiàn)。

“結(jié)構(gòu)性”指的是結(jié)構(gòu)上相似的節(jié)點(diǎn)的embedding應(yīng)該盡量接近，圖中節(jié)點(diǎn)u和節(jié)點(diǎn)s6都是各自局域網(wǎng)絡(luò)的中心節(jié)點(diǎn)，結(jié)構(gòu)上相似，其embedding的表達(dá)也應(yīng)該近似，這是“結(jié)構(gòu)性”的體現(xiàn)。

總的來說，node2vec是一種綜合考慮DFS鄰域和BFS鄰域的graph embedding方法。簡(jiǎn)單來說，可以看作是deepwalk的一種擴(kuò)展，是結(jié)合了DFS和BFS隨機(jī)游走的deepwalk。

Node2vec實(shí)現(xiàn)代碼詳解

Node2vec的算法分為如下幾步：

生成整圖的轉(zhuǎn)移概率矩陣W，生成過程中要考慮(preNode, curNode, [curNode所有鄰居dstNode的跳轉(zhuǎn)概率] )三節(jié)跳躍

基于W生成節(jié)點(diǎn)游走序列。會(huì)采用Alias Sample采樣。

對(duì)序列采用word2vec訓(xùn)練，得到node embedding?？赡懿捎秘?fù)采樣訓(xùn)練方法。

偽代碼如下：

維度d，每個(gè)節(jié)點(diǎn)生成r個(gè)長(zhǎng)度為l的語料序列。上下文context長(zhǎng)度為k。
start node : u
初始節(jié)點(diǎn)u和它的鄰域表示：{u,s4,s5,s6,s8,s9}

u的鄰域： $N_s(u)=(s_4,s_5,s_6,s_8,s_9)$

walk是節(jié)點(diǎn)u生成的隨機(jī)游走的樣本結(jié)果集。為了便于說明，上下兩個(gè)偽代碼框分別從0開始編號(hào)。

第5行到第8行對(duì)每個(gè)頂點(diǎn)進(jìn)行r輪游走，生成長(zhǎng)度為l的頂點(diǎn)序列，保存在集合walks中。

第9行是做梯度下降優(yōu)化。推薦使用負(fù)采樣

node2vecWalk部分的第1步是初始化序列walk，只需注意此時(shí)walk已經(jīng)包含了兩個(gè)元素[pre, curr]，然后進(jìn)行l(wèi)步n2v游走。第3行獲取walk中上一步的節(jié)點(diǎn)作為當(dāng)前節(jié)點(diǎn)curr，…剩下就沒有難度了。

注意，在GetNeighbors中，對(duì)于當(dāng)前節(jié)點(diǎn)curr，可以對(duì)所有鄰居采樣，再計(jì)算。

代碼寫得很清楚，只需要注意構(gòu)建圖的時(shí)候，對(duì)每個(gè)節(jié)點(diǎn)先走一步，產(chǎn)生(prevNodeId ,currentNodeId) 這個(gè)itempair，在調(diào)用node2vecWalk，對(duì)currentNodeId走l輪，產(chǎn)生(item1,item2,item3…)

具體源碼如下：

這部分為Alias Sample采樣算法

import numpy as np def alias_setup(probs):'''Compute utility lists for non-uniform sampling from discrete distributions.Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/for details'''K = len(probs)q = np.zeros(K) #保存樣本概率J = np.zeros(K, dtype=np.int) #保存補(bǔ)1的事件smaller = []larger = []for kk, prob in enumerate(probs):q[kk] = K*probif q[kk] < 1.0:smaller.append(kk)else:larger.append(kk)while len(smaller) > 0 and len(larger) > 0:small = smaller.pop()large = larger.pop()J[small] = largeq[large] = q[large] + q[small] - 1.0 #q[large]-(1-q[small])if q[large] < 1.0:smaller.append(large)else:larger.append(large)return J, q #（alias，prab）def alias_draw(J, q):'''Draw sample from a non-uniform discrete distribution using alias sampling.'''K = len(J)kk = int(np.floor(np.random.rand()*K))if np.random.rand() < q[kk]:return kkelse:return J[kk]

這部分為真正的源碼。

import numpy as np import networkx as nx import random from gensim.models import word2vecclass Graph():def __init__(self, nx_G, is_directed, p, q):self.G = nx_Gself.is_directed = is_directedself.p = pself.q = qdef node2vec_walk(self, walk_length, start_node):'''Simulate a random walk starting from start node.'''G = self.Galias_nodes = self.alias_nodesalias_edges = self.alias_edgeswalk = [start_node]while len(walk) < walk_length:cur = walk[-1]cur_nbrs = sorted(G.neighbors(cur))if len(cur_nbrs) > 0:# 如果序列中僅有一個(gè)結(jié)點(diǎn)，即第一次游走# alias_nodes中保存了alias_setup的[alias, accept]，通過alias_draw返回采樣的下一個(gè)索引號(hào)if len(walk) == 1:walk.append(cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])])else:# 當(dāng)前游走結(jié)點(diǎn)的前一個(gè)結(jié)點(diǎn)和下一個(gè)節(jié)點(diǎn)prev = walk[-2]# 使用alias_edges中記錄的[alias, accept]，來采樣鄰居中的下一個(gè)節(jié)點(diǎn)next = cur_nbrs[alias_draw(alias_edges[(prev, cur)][0], alias_edges[(prev, cur)][1])]walk.append(next)else:breakreturn walkdef simulate_walks(self, num_walks, walk_length):'''Repeatedly simulate random walks from each node.'''G = self.Gwalks = []nodes = list(G.nodes())# nodes采樣一次為一個(gè)epoch，此處就是num_walks個(gè)epochprint('Walk iteration:')for walk_iter in range(num_walks):print(str(walk_iter+1), '/', str(num_walks))random.shuffle(nodes)for node in nodes:walks.append(self.node2vec_walk(walk_length=walk_length, start_node=node))return walksdef get_alias_edge(self, src, dst):'''Get the alias edge setup lists for a given edge.:return alias_setup(): 在上一次訪問頂點(diǎn) t ，當(dāng)前訪問頂點(diǎn)為 v 時(shí)到下一個(gè)頂點(diǎn) x 的未歸一化轉(zhuǎn)移概率。:param src: 隨機(jī)游走序列種的上一個(gè)結(jié)點(diǎn):param dst: 當(dāng)前結(jié)點(diǎn)參數(shù)p控制重復(fù)訪問剛剛訪問過的頂點(diǎn)的概率。若p較大，則訪問剛剛訪問過的頂點(diǎn)的概率會(huì)變低。參數(shù)q控制著游走是向外還是向內(nèi)：若q>1，隨機(jī)游走傾向于訪問和上一次的t接近的頂點(diǎn)(偏向BFS)；若q<1，傾向于訪問遠(yuǎn)離t的頂點(diǎn)(偏向DFS)'''G = self.Gp = self.pq = self.qunnormalized_probs = []for dst_nbr in sorted(G.neighbors(dst)):if dst_nbr == src: # 如果是要返回上一個(gè)節(jié)點(diǎn)unnormalized_probs.append(G[dst][dst_nbr]['weight']/p)elif G.has_edge(dst_nbr, src): # 如果接下來訪問的節(jié)點(diǎn)與src的距離與當(dāng)前節(jié)點(diǎn)相等unnormalized_probs.append(G[dst][dst_nbr]['weight'])else:unnormalized_probs.append(G[dst][dst_nbr]['weight']/q)norm_const = sum(unnormalized_probs)normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]return alias_setup(normalized_probs)def preprocess_transition_probs(self):'''Preprocessing of transition probabilities for guiding the random walks.用于引導(dǎo)隨機(jī)游走的預(yù)處理，得到馬爾可夫轉(zhuǎn)移概率矩陣。'''G = self.Gis_directed = self.is_directedalias_nodes = {}# G.neighbors(node) 與頂點(diǎn)相鄰的所有頂點(diǎn)，更方便更快的訪問adjacency字典用: G[cur]for node in G.nodes():# 根據(jù)鄰居節(jié)點(diǎn)的權(quán)重，計(jì)算轉(zhuǎn)移概率unnormalized_probs = [G[node][nbr]['weight'] for nbr in sorted(G.neighbors(node))]norm_const = sum(unnormalized_probs)# 計(jì)算當(dāng)前節(jié)點(diǎn)到鄰居節(jié)點(diǎn)的轉(zhuǎn)移概率，其實(shí)就是權(quán)重歸一化normalized_probs = [float(u_prob)/norm_const for u_prob in unnormalized_probs]# 設(shè)置alias table，保存每個(gè)節(jié)點(diǎn)的accept[i]和alias[i]，為后面alias采樣做準(zhǔn)備。alias_nodes[node] = alias_setup(normalized_probs)alias_edges = {}triads = {}# 保存每條邊的accept[i]和alias[i]if is_directed:for edge in G.edges():alias_edges[edge] = self.get_alias_edge(edge[0], edge[1])else:for edge in G.edges():alias_edges[edge] = self.get_alias_edge(edge[0], edge[1]) # 隨機(jī)游走序列種的上一個(gè)結(jié)點(diǎn) 當(dāng)前節(jié)點(diǎn)alias_edges[(edge[1], edge[0])] = self.get_alias_edge(edge[1], edge[0])self.alias_nodes = alias_nodesself.alias_edges = alias_edgesprint(self.alias_nodes)print(self.alias_edges)returndef alias_setup(probs):'''Compute utility lists for non-uniform sampling from discrete distributions.Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/for details:param probs: 指定的采樣結(jié)果概率分布列表。期望按這個(gè)概率列表來采樣每個(gè)隨機(jī)變量X。:return J: alias[i]表示第i列中不是事件i的另一個(gè)事件的編號(hào)。:return p: accept[i]表示事件i占第i列矩形的面積的比例。'''K = len(probs)# q表示：accept數(shù)組q = np.zeros(K)# J表示：alias數(shù)組J = np.zeros(K, dtype=np.int)# Alias方法將整個(gè)概率分布?jí)撼梢粋€(gè) 1*N 的矩形，每個(gè)事件轉(zhuǎn)換為矩形中的面積。# 將面積大于1的事件多出的面積補(bǔ)充到面積小于1對(duì)應(yīng)的事件中，以確保每一個(gè)小方格的面積為1，# 同時(shí)，保證每一方格至多存儲(chǔ)兩個(gè)事件。smaller = [] # 面積小于1的事件larger = [] # 面積大于1的事件for kk, prob in enumerate(probs):q[kk] = K*probif q[kk] < 1.0:smaller.append(kk)else:larger.append(kk)while len(smaller) > 0 and len(larger) > 0:small = smaller.pop()large = larger.pop()J[small] = large# 其實(shí)是 q[large] - (1.0 - q[small])，把大的削去(1.0 - q[small])填充到小的上q[large] = q[large] + q[small] - 1.0# 大的剩下的面積，放到下一輪繼續(xù)倒騰if q[large] < 1.0:smaller.append(large)else:larger.append(large)return J, qdef alias_draw(J, q):'''Draw sample from a non-uniform discrete distribution using alias sampling.參考：https://zhuanlan.zhihu.com/p/54867139:param q: accept數(shù)組，表示事件i占第i列矩形的面積的比例；:param J: alias數(shù)組，表示alias矩形的第i列中不是事件i的另一個(gè)事件的編號(hào)，也就是填充的那一列的序號(hào)；生成一個(gè)隨機(jī)數(shù) kk in [0, K]，另一個(gè)隨機(jī)數(shù) x in [0,1],如果 x < accept[kk]，表示接受事件kk，返回kk，否則拒絕事件kk，返回alias[kk]'''K = len(J)kk = int(np.floor(np.random.rand()*K))if np.random.rand() < q[kk]:return kkelse:return J[kk]def read_graph(input_file, directed):'''Reads the input network in networkx.'''if directed:G = nx.read_edgelist(input_file, delimiter=",", nodetype=int, data=(('weight',float),), create_using=nx.DiGraph())else:G = nx.read_edgelist(input_file, delimiter=",", nodetype=int, create_using=nx.DiGraph())for edge in G.edges():G[edge[0]][edge[1]]['weight'] = 1if not directed:G = G.to_undirected()return Gdef learn_embeddings(walks):'''Learn embeddings by optimizing the Skipgram objective using SGD.'''walks = [list(map(str, walk)) for walk in walks]print(walks) # model = word2vec.Word2Vec(walks, vector_size=64, window=3, min_count=0, sg=1, workers=1, epochs=5)# model.save_word2vec_format(args.output)#model.wv.save_word2vec_format(args.output, binary=False)returndef main(directed):'''Pipeline for representational learning for all nodes in a graph.'''nx_G = read_graph(r"C:\Users\Administrator\TensorFlow\game.csv", directed)print(list(nx_G.edges(data=True)), list(nx_G))for node in nx_G.neighbors(2):print(node)G = Graph(nx_G, False, 1, 2)G.preprocess_transition_probs()walks = G.simulate_walks(5, 3)learn_embeddings(walks)if __name__ == "__main__":main(directed = False)

Alias Sample Method 別名采樣算法

更詳細(xì)的可以觀看如下blog：
Darts, Dice, and Coins: Sampling from a Discrete Distribution

在隨機(jī)游走的過程中，假設(shè)已經(jīng)游走到當(dāng)前節(jié)點(diǎn)，則下一步的游走要參考下面的公式。
$πvx=αpq(t,x)?wvx\pi_{vx}=\alpha_{pq}(t,x) \cdot w_{vx}$

則對(duì)于不同的節(jié)點(diǎn)，被采樣到的概率是不一樣的，因此需要采用采樣算法。Alias Sample算法是一種時(shí)間復(fù)雜度為O(1)的算法，因此特別適合大規(guī)模的網(wǎng)絡(luò)游走情況。

構(gòu)建Alias Table

每個(gè)概率乘以四（事件數(shù)）。

然后拼湊使每列值為 1，并保證每列最多只包含兩個(gè)事件。

至此整個(gè)方法大功告成。

Alias Method具體算法如下：

按照上面說的方法，將整個(gè)概率分布拉平成為一個(gè)1*N的長(zhǎng)方形即為Alias Table，構(gòu)建上面那張圖之后，儲(chǔ)存兩個(gè)數(shù)組，一個(gè)里面存著第i列對(duì)應(yīng)的事件ii矩形站的面積百分比【也即其概率】，上圖的話數(shù)組就為

P r a b [2 / 3, 1 / 1, 1 / 3, 1 / 3]

，另一個(gè)數(shù)組里面儲(chǔ)存著第i列不是事件i的另外一個(gè)事件的標(biāo)號(hào)，像上圖就是Alias[2 NULL 1 1]
2.產(chǎn)生兩個(gè)隨機(jī)數(shù)，第一個(gè)產(chǎn)生1~N 之間的整數(shù)i，決定落在哪一列。扔第二次骰子，0~1之間的任意數(shù)，判斷其與Prab[i]大小，如果小于Prab[i]，則采樣i，如果大于Prab[i]，則采樣Alias[i]

讀者可以采樣自己運(yùn)行一下Alias算法，體會(huì)一下，代碼在上面。

總結(jié)

以上是生活随笔為你收集整理的Node2vec原理剖析，代码实现的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Linux C 下的socket网络编程
下一篇： iptv写代理php,苏州电信iptv用