NLP之word2vec:利用 Wikipedia Text(中文维基百科)语料+Word2vec工具来训练简体中文词向量
NLP之word2vec:利用 Wikipedia Text(中文維基百科)語料+Word2vec工具來訓練簡體中文詞向量
?
?
目錄
輸出結果
設計思路
1、Wikipedia Text語料來源
2、維基百科的文檔解析
3、中文的簡繁轉換
4、將非utf-8格式字符轉換為utf-8格式
5、調用word2vec
實現代碼
?
?
?
?
?
輸出結果
后期更新……
?
最后的model
word2vec_wiki.model.rar
?
設計思路
后期更新……
?
1、Wikipedia Text語料來源
Wikipedia Text語料來源及其下載:zhwiki dump progress on 20190120
? ? ? ?其中zhwiki-latest-pages-articles.xml.bz2文件包含了標題、正文部分。壓縮包大概是1.3G,解壓后大概是5.7G。相比英文wiki中文的還是小了不少。
2、維基百科的文檔解析
? ? ? ? 下載下來的wiki是XML格式,需要提取其正文內容。不過維基百科的文檔解析有不少的成熟工具(例如gensim,wikipedia extractor等。其中Wikipedia Extractor?是一個簡單方便的Python腳本。
T1、Wikipedia Extractor工具
Wikipedia extractor的網址:?http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
Wikipedia extractor的使用:下載好WikiExtractor.py后直接使用下面的命令運行即可,
?? ? ? ? ? ? ? ? ??其中,-cb 1200M表示以 1200M 為單位切分文件,-o 后面接出入文件,最后是輸入文件。
T2、python代碼實現
?? ?將這個XML壓縮文件轉換為txt文件
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text??
3、中文的簡繁轉換
? ? ?中文wiki內容中大多數是繁體,這需要進行簡繁轉換??梢圆捎脧B門大學NLP實驗室開發的簡繁轉換工具或者opencc代碼實現。
T1、廈門大學NLP實驗室開發的簡繁轉換工具
轉換工具下載網址:http://jf.cloudtranslation.cc/
轉換工具的使用:下載單機版即可,在windos命令行窗口下使用下面命令行運行
?? ? ? ? ? ? ? ? ??其中file1.txt為繁體原文文件,file2.txt為輸出轉換結果的目標文件名,lm_s2t.txt為語言模型文件。
T2、opencc代碼實現
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini, 將繁體字轉換為簡體字。?
4、將非utf-8格式字符轉換為utf-8格式
iconv -c -t UTF-8 < wiki.zh.text.jian.seg > wiki.zh.text.jian.seg.utf-8?
5、調用word2vec
python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector?
實現代碼
正在更新……
對下邊文件代碼的說明
#We create word2vec model use wiki Text like this https://dumps.wikimedia.org/zhwiki/20161001/zhwiki-20161001-pages-articles-multistream.xml.bz2##parameter: =================================feature_size = 500content_window = 5freq_min_count = 3# threads_num = 4negative = 3 ? #best采樣使用hierarchical softmax方法(負采樣,對常見詞有利),不使用negative sampling方法(對罕見詞有利)。iter = 20##process.py deal with wiki*.xml##word2vec_wiki.py : create model and load model?
1、process.py文件
#process.py文件 import logging import os.path import sys from gensim.corpora import WikiCorpus#run python process_wiki.py ../data/zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text if __name__ == '__main__':program = os.path.basename(sys.argv[0])logger = logging.getLogger(program)logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')logging.root.setLevel(level=logging.INFO)logger.info("running %s" % ' '.join(sys.argv))# check and process input argumentsif len(sys.argv) < 3:print(('globals()[__doc__]):', globals()['__doc__']) )print('locals:', locals())print(globals()['__doc__'] % locals()) #最初代碼 print globals()['__doc__'] % locals()sys.exit(1)inp, outp = sys.argv[1:3]space = " "i = 0output = open(outp, 'w')wiki = WikiCorpus(inp, lemmatize=False, dictionary={})for text in wiki.get_texts():output.write(space.join(text) + "\n")i = i + 1if (i % 10000 == 0):logger.info("Saved " + str(i) + " articles")output.close()logger.info("Finished Saved " + str(i) + " articles")2、word2vec_wiki.py文件
#word2vec_wiki.py文件# -*- coding:utf-8 -*- from __future__ import print_function import numpy as np import os import sys import jieba import time import jieba.posseg as pseg import codecs import multiprocessing import json # from gensim.models import Word2Vec,Phrases from gensim import models,corpora import logging # auto_brand = codecs.open("Automotive_Brand.txt", encoding='utf-8').read()sys.path.append("../../") sys.path.append("../../langconv/") sys.path.append("../../parser/") # import xmlparser # from xmlparser import * # from langconv import *# logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) # logger.info("running %s" % ' '.join(sys.argv))def json_dict_from_file(json_file,fieldnames=None,isdelwords=True):"""load json file and generate a new object instance whose __name__ filedwill be 'inst':param json_file:"""obj_s = []with open(json_file) as f:for line in f:object_dict = json.loads(line)if fieldnames==None:obj_s.append(object_dict)else:# for fieldname in fieldname:if set(fieldnames).issubset(set(object_dict.keys())):one = []for fieldname in fieldnames:if isdelwords and fieldname == 'content':one.append(delNOTNeedWords(object_dict[fieldname])[1])else:one.append(object_dict[fieldname])obj_s.append(one)return obj_sdef delNOTNeedWords(content,customstopwords=None):# words = jieba.lcut(content)if customstopwords == None:customstopwords = "stopwords.txt"import osif os.path.exists(customstopwords):stop_words = codecs.open(customstopwords, encoding='UTF-8').read().split(u'\n')customstopwords = stop_wordsresult=''return_words = []# for w in words:# if w not in stopwords:# result += w.encode('utf-8') # +"/"+str(w.flag)+" " #去停用詞words = pseg.lcut(content)for word, flag in words:# print word.encode('utf-8')tempword = word.encode('utf-8').strip(' ')if (word not in customstopwords and len(tempword)>0 and flag in [u'n',u'nr',u'ns',u'nt',u'nz',u'ng',u't',u'tg',u'f',u'v',u'vd',u'vn',u'vf',u'vx',u'vi',u'vl',u'vg', u'a',u'an',u'ag',u'al',u'm',u'mq',u'o',u'x']):# and flag[0] in [u'n', u'f', u'a', u'z']):# ["/x","/zg","/uj","/ul","/e","/d","/uz","/y"]): #去停用詞和其他詞性,比如非名詞動詞等result += tempword # +"/"+str(w.flag)+" " #去停用詞return_words.append(tempword)return result,return_wordsdef get_save_wikitext(wiki_filename,text_filename):output = open(text_filename, 'w')wiki = corpora.WikiCorpus(text_filename, lemmatize=False, dictionary={})for text in wiki.get_texts():# text = delNOTNeedWords(text,"../../stopwords.txt")[1]output.write(" ".join(text) + "\n")i = i + 1if (i % 10000 == 0):logging.info("Saved " + str(i) + " articles")output.close()def load_save_word2vec_model(line_words, model_filename):# 模型參數feature_size = 500content_window = 5freq_min_count = 3# threads_num = 4negative = 3 #best采樣使用hierarchical softmax方法(負采樣,對常見詞有利),不使用negative sampling方法(對罕見詞有利)。iter = 20print("word2vec...")tic = time.time()if os.path.isfile(model_filename):model = models.Word2Vec.load(model_filename)print(model.vocab)print("Loaded word2vec model")else:bigram_transformer = models.Phrases(line_words)model = models.Word2Vec(bigram_transformer[line_words], size=feature_size, window=content_window, iter=iter, min_count=freq_min_count,negative=negative, workers=multiprocessing.cpu_count())toc = time.time()print("Word2vec completed! Elapsed time is %s." % (toc-tic))model.save(model_filename)# model.save_word2vec_format(save_model2, binary=False)print("Word2vec Saved!")return modelif __name__ == '__main__':limit = -1 #該屬性決定取wiki文件text tag前多少條,-1為所有wiki_filename = '/home/wac/data/zhwiki-20160203-pages-articles-multistream.xml'wiki_text = './wiki_text.txt'wikimodel_filename = './word2vec_wiki.model's_list = []# if you want get wiki text ,uncomment lines# get_save_wikitext(wiki_filename,wiki_text)# for i,text in enumerate(open(wiki_text, 'r')):# s_list.append(delNOTNeedWords(text,"../../stopwords.txt")[1])# print(i)## if i==limit: #取前limit條,-1為所有# break##計算模型model = load_save_word2vec_model(s_list, wikimodel_filename)#計算相似單詞,命令行輸入while 1:print("請輸入想測試的單詞: ", end='\b')t_word = sys.stdin.readline()if "quit" in t_word:breaktry:results = model.most_similar(t_word.decode('utf-8').strip('\n').strip('\r').strip(' ').split(' '), topn=30)except:continuefor t_w, t_sim in results:print(t_w, " ", t_sim)?
?
?
?
?
參考文章(貼上源址表示感謝)
使用維基百科訓練簡體中文詞向量
中文Wiki語料獲取
Wiki語料處理
中文維基語料訓練獲取
Windows3.5下對維基百科語料用word2vec進行訓練尋找同義詞相似度
?
?
總結
以上是生活随笔為你收集整理的NLP之word2vec:利用 Wikipedia Text(中文维基百科)语料+Word2vec工具来训练简体中文词向量的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: DL之CNN:基于CRNN_OCR算法(
- 下一篇: 成功解决TypeError: unsup