Word2Vec训练同义词模型
一、需求描述
???? 業(yè)務(wù)需求的目標(biāo)是識(shí)別出目標(biāo)詞匯的同義詞和相關(guān)詞匯,如下為部分目標(biāo)詞匯(主要用于醫(yī)療問(wèn)診):
尿
痘痘
發(fā)冷
呼吸困難
惡心
數(shù)據(jù)源是若干im數(shù)據(jù),那么這里我們選擇google 的word2vec模型來(lái)訓(xùn)練同義詞和相關(guān)詞。
二、數(shù)據(jù)處理
????數(shù)據(jù)處理考慮以下幾個(gè)方面:
1. 從hive中導(dǎo)出不同數(shù)據(jù)量的數(shù)據(jù)
2. 過(guò)濾無(wú)用的訓(xùn)練樣本(例如字?jǐn)?shù)少于5)
3. 準(zhǔn)備自定義的詞匯表
4. 準(zhǔn)備停用詞表
三、工具選擇
????選擇python 的gensim庫(kù),由于先做預(yù)研,數(shù)據(jù)量不是很大,選擇單機(jī)就好,暫時(shí)不考慮spark訓(xùn)練。后續(xù)生產(chǎn)環(huán)境計(jì)劃上spark。
詳細(xì)的gensim中word2vec文檔
上述文檔有關(guān)工具的用法已經(jīng)很詳細(xì)了,就不多說(shuō)。
分詞采用jieba。
四、模型訓(xùn)練步驟簡(jiǎn)述
1.先做分詞、去停用詞處理
seg_word_line = jieba.cut(line, cut_all = True)2.將分詞的結(jié)果作為模型的輸入
model = gensim.models.Word2Vec(LineSentence(source_separated_words_file), size=200, window=5, min_count=5, alpha=0.02, workers=4)3.保存模型,方便以后調(diào)用,獲得目標(biāo)詞的同義詞
similary_words = model.most_similar(w, topn=10)五、重要調(diào)參目標(biāo)
???? 比較重要的參數(shù):
1. 訓(xùn)練數(shù)據(jù)的大小,當(dāng)初只用了10萬(wàn)數(shù)據(jù),訓(xùn)練出來(lái)的模型很不好,后邊不斷地將訓(xùn)練語(yǔ)料增加到800萬(wàn),效果得到了明顯的提升
2. 向量的維度,這是詞匯向量的維數(shù),這個(gè)會(huì)影響到計(jì)算,理論上來(lái)說(shuō)維數(shù)大一點(diǎn)會(huì)好。
3. 學(xué)習(xí)速率
4. 窗口大小
在調(diào)參上,并沒(méi)有花太多精力,因?yàn)槟繙y(cè)結(jié)果還好,到時(shí)上線使用前再仔細(xì)調(diào)整。
六、模型的實(shí)際效果
| 尿 | 尿液,撒尿,尿急,尿尿有,尿到,內(nèi)褲,尿意,小解,前列腺炎,小便 |
| 痘痘 | 逗逗,豆豆,痘子,小痘,青春痘,紅痘,長(zhǎng)痘痘,粉刺,諷刺,白頭 |
| 發(fā)冷 | 發(fā)燙,沒(méi)力,忽冷忽熱,時(shí)冷時(shí)熱,小柴胡,頭昏,嗜睡,38.9,頭暈,發(fā)寒 |
| 呼吸困難 | 氣來(lái),氣緊,窒息,大氣,透不過(guò)氣,出不上,瀕死,粗氣,壓氣,心律不齊 |
| 惡心 | 悶,力氣,嘔心,脹氣,漲,不好受,不進(jìn),暈車,悶悶,精神 |
七、可以跑的CODE
import codecs import jieba import gensim from gensim.models.word2vec import LineSentencedef read_source_file(source_file_name):try:file_reader = codecs.open(source_file_name, 'r', 'utf-8',errors="ignore")lines = file_reader.readlines()print("Read complete!")file_reader.close()return linesexcept:print("There are some errors while reading.")def write_file(target_file_name, content):file_write = codecs.open(target_file_name, 'w+', 'utf-8')file_write.writelines(content)print("Write sussfully!")file_write.close()def separate_word(filename,user_dic_file, separated_file):print("separate_word")lines = read_source_file(filename)#jieba.load_userdict(user_dic_file)stopkey=[line.strip() for line in codecs.open('stopword_zh.txt','r','utf-8').readlines()]output = codecs.open(separated_file, 'w', 'utf-8')num = 0for line in lines:num = num + 1if num% 10000 == 0:print("Processing line number: " + str(num))seg_word_line = jieba.cut(line, cut_all = True)wordls = list(set(seg_word_line)-set(stopkey))if len(wordls)>0:word_line = ' '.join(wordls) + '\n'output.write(word_line)output.close()return separated_filedef build_model(source_separated_words_file,model_path):print("start building...",source_separated_words_file)model = gensim.models.Word2Vec(LineSentence(source_separated_words_file), size=200, window=5, min_count=5, alpha=0.02, workers=4) model.save(model_path)print("build successful!", model_path)return modeldef get_similar_words_str(w, model, topn = 10):result_words = get_similar_words_list(w, model) return str(result_words)def get_similar_words_list(w, model, topn = 10):result_words = []try:similary_words = model.most_similar(w, topn=10)print(similary_words)for (word, similarity) in similary_words:result_words.append(word)print(result_words)except:print("There are some errors!" + w)return result_wordsdef load_models(model_path):return gensim.models.Word2Vec.load(model_path)if "__name__ == __main__()":filename = "d:\\data\\dk_mainsuit_800w.txt" #source fileuser_dic_file = "new_dict.txt" # user dic fileseparated_file = "d:\\data\\dk_spe_file_20170216.txt" # separeted words filemodel_path = "information_model0830" # model file#source_separated_words_file = separate_word(filename, user_dic_file, separated_file)source_separated_words_file = separated_file # if separated word file exist, don't separate_word againbuild_model(source_separated_words_file, model_path)# if model file is exist, don't buile modl model = load_models(model_path)words = get_similar_words_str('頭痛', model)print(words)總結(jié)
以上是生活随笔為你收集整理的Word2Vec训练同义词模型的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: hdu1589(枚举+并查集)
- 下一篇: c#.net课程设计:ZCMU通讯录(待