生活随笔
收集整理的這篇文章主要介紹了
DSSM学习——入门及实验篇
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
DSSM學習報告
從文本挖掘的知識圖譜開始
-
當對一個領域只了解某些部分的時候,從一個知識圖出發是最好的。
-
查概念可以查到,DSSM是通過搜索引擎里 Query 和 Title 的海量的點擊曝光日志,用 DNN 把 Query 和 Title 表達為低緯語義向量,并通過 cosine 距離來計算兩個語義向量的距離,最終訓練出語義相似度模型。該模型既可以用來預測兩個句子的語義相似度,又可以獲得某句子的低緯語義向量表達。
-
主要解決的問題有:
- 解決了LSA、LDA、Autoencoder等方法存在的一個最大的問題:字典爆炸(導致計算復雜度非常高),因為在英文單詞中,詞的數量可能是沒有限制的,但是字母 [公式] -gram的數量通常是有限的
- 基于詞的特征表示比較難處理新詞,字母的 [公式] -gram可以有效表示,魯棒性較強
- 使用有監督方法,優化語義embedding的映射問題
- 省去了人工的特征工程
-
看關鍵詞的話,感覺應該是在這個片區
-
表達為低緯向量,走的時候word embedding區域的隱語義分析區域的那塊。
-
cosine距離計算兩個語義向量距離,則是document片區的string distance。
-
幾乎沒怎么搜到涉及到text mining的document關鍵詞的內容,猜測是詞向量方向涉及到的專有名詞,描述詞向量相關屬性的內容。
代碼及相關研究
權威論文及相關博客
DSSM & Multi-view DSSM TensorFlow實現:
https://github.com/InsaneLife/dssm
https://blog.csdn.net/shine19930820/article/details/79042567
Learning Deep Structured Semantic Models for Web Search using Clickthrough Data ※
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf
A Latent Semantic Model with Convolutional-Pooling (CNN-DSSM)
http://www.iro.umontreal.ca/~lisa/pointeurs/ir0895-he-2.pdf
SEMANTIC MODELLING WITH LONG-SHORT-TERM MEMORY FOR INFORMATION RETRIEVAL
https://arxiv.org/pdf/1412.6629.pdf
A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf
微軟關于DSSM研究一個主頁
https://www.microsoft.com/en-us/research/project/dssm/#!publications
我們選用第二篇作為我們的實驗研究的代碼
了解DSSM解決的背景
- 一般的搜索引擎往往都是搜索與query中詞匯重復的內容。
- 但是人們在搜索的時候,往往無法準確的描述自己想要的內容(或者存在同義異詞的情況)
- 這個時候搜索出的內容往往是不準確的。
- 此時就需要用潛在語義模型來匹配“query-需要的結果”。
- 然后因為以前的主題模型基本都是非監督,依賴上下文,所以經常出現不準的情況。
- 現在我們【寫論文的人】拿到了clickthrough的數據,就可以做出一個query-document匹配的模型了。
- 【以上論文第一頁INTRODUCTION部分】
數據
桂魚 | {“桂魚多少錢一斤”: “0.071”, “桂魚價格”: “0.013”, “桂魚圖片大全”: “0.012”, “桂魚湯”: “0.010”, “桂魚的營養價值與功效”: “0.011”, “桂魚圖片”: “0.178”, “桂魚怎么釣”: “0.024”, “桂魚的做法”: “0.054”, “桂魚多少錢一斤2018”: “0.012”, “桂魚怎么做好吃”: “0.116”} | 石桂魚 | 百科 | 0
- “|”為分界。
- 第一個是prefix,代表用戶一開始輸入了什么。
- 第二個是預測的結果,根據當前前綴,預測的用戶完整需求查詢詞,最多10條;預測的查詢詞可能是前綴本身,數字為統計概率 。
- 第三個是文章標題(應該是指最后點進去的文章標題)。
- 第四個是文章內容標簽 。
- 第五個是是否點擊。
- 為了應用來訓練DSSM demo,將prefix和title作為正樣,prefix和query_prediction(除title以外)作為負樣本。
- 英文版的clickthrough數據沒有公布,但是可以預見形式應該差不多。
- 中英文的wordhash方法不一樣,英文的用word-ngram,但是中文以詞匯為單位,向量空間太大了;所以中文以字為單位,用的是charactergram.
代碼研究
- 研究的代碼來自這個倉庫: https://github.com/LiangHao151941/dssm
- 文件為:dssm_v3.py
- 最核心的就是參考這張圖
- 最核心的內容還是結合這張圖來看。代碼里無論是層級,還是權重輸入和輸出,圖片基本都是相對應的。
- 下面是代碼和詳細的注釋。
import pickle
import random
import time
import sys
import numpy
as np
import tensorflow
as tfflags
= tf
.app
.flags
FLAGS
= flags
.FLAGS
flags
.DEFINE_string
('summaries_dir', '/tmp/dssm-400-120-relu', 'Summaries directory')
flags
.DEFINE_float
('learning_rate', 0.1, 'Initial learning rate.')
flags
.DEFINE_integer
('max_steps', 900000, 'Number of steps to run trainer.')
flags
.DEFINE_integer
('epoch_steps', 18000, "Number of steps in one epoch.")
flags
.DEFINE_integer
('pack_size', 2000, "Number of batches in one pickle pack.")
flags
.DEFINE_bool
('gpu', 1, "Enable GPU or not")
start
= time
.time
()doc_train_data
= None
query_train_data
= None
query_test_data
= pickle
.load
(open('../data/query.test.pickle', 'rb')).tocsr
()
doc_test_data
= pickle
.load
(open('../data/doc.test.pickle', 'rb')).tocsr
()doc_train_data
= pickle
.load
(open('../data/doc.train.pickle', 'rb')).tocsr
()
query_train_data
= pickle
.load
(open('../data/query.train.pickle', 'rb')).tocsr
()end
= time
.time
()
print("Loading data from HDD to memory: %.2fs" % (end
- start
))
TRIGRAM_D
= 49284
NEG
= 50
BS
= 1024
L1_N
= 400
L2_N
= 120
query_in_shape
= np
.array
([BS
, TRIGRAM_D
], np
.int64
)
doc_in_shape
= np
.array
([BS
, TRIGRAM_D
], np
.int64
)def variable_summaries(var
, name
):"""Attach a lot of summaries to a Tensor."""with tf
.name_scope
('summaries'):mean
= tf
.reduce_mean
(var
)tf
.scalar_summary
('mean/' + name
, mean
)with tf
.name_scope
('stddev'):stddev
= tf
.sqrt
(tf
.reduce_sum
(tf
.square
(var
- mean
)))tf
.scalar_summary
('sttdev/' + name
, stddev
)tf
.scalar_summary
('max/' + name
, tf
.reduce_max
(var
))tf
.scalar_summary
('min/' + name
, tf
.reduce_min
(var
))tf
.histogram_summary
(name
, var
)
with tf
.name_scope
('input'):query_batch
= tf
.sparse_placeholder
(tf
.float32
, shape
=query_in_shape
, name
='QueryBatch')doc_batch
= tf
.sparse_placeholder
(tf
.float32
, shape
=doc_in_shape
, name
='DocBatch')
with tf
.name_scope
('L1'):l1_par_range
= np
.sqrt
(6.0 / (TRIGRAM_D
+ L1_N
))weight1
= tf
.Variable
(tf
.random_uniform
([TRIGRAM_D
, L1_N
], -l1_par_range
, l1_par_range
))bias1
= tf
.Variable
(tf
.random_uniform
([L1_N
], -l1_par_range
, l1_par_range
))variable_summaries
(weight1
, 'L1_weights')variable_summaries
(bias1
, 'L1_biases')query_l1
= tf
.sparse_tensor_dense_matmul
(query_batch
, weight1
) + bias1doc_l1
= tf
.sparse_tensor_dense_matmul
(doc_batch
, weight1
) + bias1query_l1_out
= tf
.nn
.relu
(query_l1
)doc_l1_out
= tf
.nn
.relu
(doc_l1
)
with tf
.name_scope
('L2'):l2_par_range
= np
.sqrt
(6.0 / (L1_N
+ L2_N
))weight2
= tf
.Variable
(tf
.random_uniform
([L1_N
, L2_N
], -l2_par_range
, l2_par_range
))bias2
= tf
.Variable
(tf
.random_uniform
([L2_N
], -l2_par_range
, l2_par_range
))variable_summaries
(weight2
, 'L2_weights')variable_summaries
(bias2
, 'L2_biases')query_l2
= tf
.matmul
(query_l1_out
, weight2
) + bias2doc_l2
= tf
.matmul
(doc_l1_out
, weight2
) + bias2query_y
= tf
.nn
.relu
(query_l2
)doc_y
= tf
.nn
.relu
(doc_l2
)
with tf
.name_scope
('FD_rotate'):temp
= tf
.tile
(doc_y
, [1, 1])for i
in range(NEG
):rand
= int((random
.random
() + i
) * BS
/ NEG
)doc_y
= tf
.concat
(0,[doc_y
,tf
.slice(temp
, [rand
, 0], [BS
- rand
, -1]),tf
.slice(temp
, [0, 0], [rand
, -1])])with tf
.name_scope
('Cosine_Similarity'):query_norm
= tf
.tile
(tf
.sqrt
(tf
.reduce_sum
(tf
.square
(query_y
), 1, True)), [NEG
+ 1, 1])doc_norm
= tf
.sqrt
(tf
.reduce_sum
(tf
.square
(doc_y
), 1, True))prod
= tf
.reduce_sum
(tf
.mul
(tf
.tile
(query_y
, [NEG
+ 1, 1]), doc_y
), 1, True)norm_prod
= tf
.mul
(query_norm
, doc_norm
)cos_sim_raw
= tf
.truediv
(prod
, norm_prod
)cos_sim
= tf
.transpose
(tf
.reshape
(tf
.transpose
(cos_sim_raw
), [NEG
+ 1, BS
])) * 20with tf
.name_scope
('Loss'):prob
= tf
.nn
.softmax
((cos_sim
))hit_prob
= tf
.slice(prob
, [0, 0], [-1, 1])loss
= -tf
.reduce_sum
(tf
.log
(hit_prob
)) / BStf
.scalar_summary
('loss', loss
)
with tf
.name_scope
('Training'):train_step
= tf
.train
.GradientDescentOptimizer
(FLAGS
.learning_rate
).minimize
(loss
)
merged
= tf
.merge_all_summaries
()with tf
.name_scope
('Test'):average_loss
= tf
.placeholder
(tf
.float32
)loss_summary
= tf
.scalar_summary
('average_loss', average_loss
)def pull_batch(query_data
, doc_data
, batch_idx
):query_in
= query_data
[batch_idx
* BS
:(batch_idx
+ 1) * BS
, :]doc_in
= doc_data
[batch_idx
* BS
:(batch_idx
+ 1) * BS
, :]if batch_idx
== 0:print(query_in
.getrow
(53))query_in
= query_in
.tocoo
()doc_in
= doc_in
.tocoo
()query_in
= tf
.SparseTensorValue
(np
.transpose
([np
.array
(query_in
.row
, dtype
=np
.int64
), np
.array
(query_in
.col
, dtype
=np
.int64
)]),np
.array
(query_in
.data
, dtype
=np
.float),np
.array
(query_in
.shape
, dtype
=np
.int64
))doc_in
= tf
.SparseTensorValue
(np
.transpose
([np
.array
(doc_in
.row
, dtype
=np
.int64
), np
.array
(doc_in
.col
, dtype
=np
.int64
)]),np
.array
(doc_in
.data
, dtype
=np
.float),np
.array
(doc_in
.shape
, dtype
=np
.int64
))return query_in
, doc_in
def feed_dict(Train
, batch_idx
):"""Make a TensorFlow feed_dict: maps data onto Tensor placeholders."""if Train
:query_in
, doc_in
= pull_batch
(query_train_data
, doc_train_data
, batch_idx
)else:query_in
, doc_in
= pull_batch
(query_test_data
, doc_test_data
, batch_idx
)return {query_batch
: query_in
, doc_batch
: doc_in
}config
= tf
.ConfigProto
()
config
.gpu_options
.allow_growth
= True
with tf
.Session
(config
=config
) as sess
:sess
.run
(tf
.initialize_all_variables
())train_writer
= tf
.train
.SummaryWriter
(FLAGS
.summaries_dir
+ '/train', sess
.graph
)test_writer
= tf
.train
.SummaryWriter
(FLAGS
.summaries_dir
+ '/test', sess
.graph
)start
= time
.time
()for step
in range(FLAGS
.max_steps
):batch_idx
= step
% FLAGS
.epoch_steps
if batch_idx
== 0:temp
= sess
.run
(query_y
, feed_dict
=feed_dict
(True, 0))print(np
.count_nonzero
(temp
))sys
.exit
()if batch_idx
% (FLAGS
.pack_size
/ 64) == 0:progress
= 100.0 * batch_idx
/ FLAGS
.epoch_stepssys
.stdout
.write
("\r%.2f%% Epoch" % progress
)sys
.stdout
.flush
()sess
.run
(train_step
, feed_dict
=feed_dict
(True, batch_idx
% FLAGS
.pack_size
))if batch_idx
== FLAGS
.epoch_steps
- 1:end
= time
.time
()epoch_loss
= 0for i
in range(FLAGS
.pack_size
):loss_v
= sess
.run
(loss
, feed_dict
=feed_dict
(True, i
))epoch_loss
+= loss_vepoch_loss
/= FLAGS
.pack_sizetrain_loss
= sess
.run
(loss_summary
, feed_dict
={average_loss
: epoch_loss
})train_writer
.add_summary
(train_loss
, step
+ 1)print ("\nEpoch #%-5d | Train Loss: %-4.3f | PureTrainTime: %-3.3fs" %(step
/ FLAGS
.epoch_steps
, epoch_loss
, end
- start
))epoch_loss
= 0for i
in range(FLAGS
.pack_size
):loss_v
= sess
.run
(loss
, feed_dict
=feed_dict
(False, i
))epoch_loss
+= loss_vepoch_loss
/= FLAGS
.pack_sizetest_loss
= sess
.run
(loss_summary
, feed_dict
={average_loss
: epoch_loss
})test_writer
.add_summary
(test_loss
, step
+ 1)start
= time
.time
()print ("Epoch #%-5d | Test Loss: %-4.3f | Calc_LossTime: %-3.3fs" %(step
/ FLAGS
.epoch_steps
, epoch_loss
, start
- end
))
關于預處理方式和結果的關系
- 它就說這個mean Normalized Discounted Cumulative Gain(NDCG)就是衡量這個模型的標準.
- 然后得到結論,顯然是Letter-WordHashing + DNN是最好的。
總結
以上是生活随笔為你收集整理的DSSM学习——入门及实验篇的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。