當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python word2vector (三)

發(fā)布時(shí)間：2025/4/5 python 11 豆豆

生活随笔收集整理的這篇文章主要介紹了 python word2vector (三) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

三體鏈接

下載三體文件，將其從命名為santi.txt 將其放在程序的統(tǒng)一目錄下

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Wed Aug 1 10:13:28 2018@author: luogan """#!/bin/bash # -*-coding=utf-8-*- import jieba import re from gensim.models import word2vec import multiprocessing import gensimdef segment_text(source_corpus, train_corpus, coding, punctuation):'''切詞,去除標(biāo)點(diǎn)符號:param source_corpus: 原始語料:param train_corpus: 切詞語料:param coding: 文件編碼:param punctuation: 去除的標(biāo)點(diǎn)符號:return:'''with open(source_corpus, 'r', encoding=coding) as f, open(train_corpus, 'w', encoding=coding) as w:for line in f:# 去除標(biāo)點(diǎn)符號line = re.sub('[{0}]+'.format(punctuation), '', line.strip())# 切詞words = jieba.cut(line)w.write(' '.join(words))#if __name__ == '__main__':# 嚴(yán)格限制標(biāo)點(diǎn)符號 strict_punctuation = '。，、＇：∶；?‘’“”〝〞?ˇ﹕︰﹔﹖﹑·¨….?;！′？！～—ˉ｜‖＂〃｀@﹫??﹏﹋﹌︴々﹟#﹩$﹠&﹪%*﹡﹢﹦﹤‐￣ˉ―﹨??﹍﹎+=<--＿_-\ˇ~﹉﹊（）〈〉??﹛﹜『』〖〗［］《》〔〕{}「」【】︵︷︿︹︽_﹁﹃︻︶︸﹀︺︾ˉ﹂﹄︼' # 簡單限制標(biāo)點(diǎn)符號 simple_punctuation = '’!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' # 去除標(biāo)點(diǎn)符號 punctuation = simple_punctuation + strict_punctuation# 文件編碼 coding = 'utf-8'#coding ="gb18030" # 原始語料 source_corpus_text = 'santi.txt'# 是每個(gè)詞的向量維度 size = 10 # 是詞向量訓(xùn)練時(shí)的上下文掃描窗口大小，窗口為5就是考慮前5個(gè)詞和后5個(gè)詞 window = 5 # 設(shè)置最低頻率，默認(rèn)是5，如果一個(gè)詞語在文檔中出現(xiàn)的次數(shù)小于5，那么就會(huì)丟棄 min_count = 1 # 是訓(xùn)練的進(jìn)程數(shù)，默認(rèn)是當(dāng)前運(yùn)行機(jī)器的處理器核數(shù)。 workers = multiprocessing.cpu_count() # 切詞語料 train_corpus_text = 'words.txt' # w2v模型文件 model_text = 'w2v_size_{0}.model'.format(size)# 切詞 @TODO 切詞后注釋 segment_text(source_corpus_text, train_corpus_text, coding, punctuation)# w2v訓(xùn)練模型 @TODO 訓(xùn)練后注釋 sentences = word2vec.Text8Corpus(train_corpus_text) model = word2vec.Word2Vec(sentences=sentences, size=size, window=window, min_count=min_count, workers=workers) model.save(model_text)# 加載模型 model = gensim.models.Word2Vec.load(model_text) # print(model['運(yùn)動(dòng)會(huì)'])# 計(jì)算一個(gè)詞的最近似的詞，倒序 similar_words = model.most_similar('文明') for word in similar_words: print(word[0], word[1])# 計(jì)算兩詞之間的余弦相似度 sim1 = model.similarity('飛船', '愛情') print(sim1)# 計(jì)算兩個(gè)集合之間的余弦似度 list1 = ['三體', '物理'] list2 = ['相對論', '量子'] list_sim1 = model.n_similarity(list1, list2) print(list_sim1)# 選出集合中不同類的詞語 list = ['上帝', '葡萄', '基督', '愛'] print(model.doesnt_match(list)) 人類 0.9959995150566101 信息化 0.9937461614608765 宇宙 0.9928457736968994 數(shù)字化 0.9915485978126526 第三方 0.9911766052246094 世界 0.9876834154129028 三體 0.9861425161361694 蹤跡 0.985514760017395 達(dá)爾文 0.9851547479629517 退回 0.9846546649932861 0.8862147950518867 0.9169202194786192 葡萄 #將文字轉(zhuǎn)化為向量 vector = model.wv['三體'] # numpy vector of a word print(vector) [-6.710487 1.0552257 -3.034105 -4.0415897 4.609099 2.3800325-0.9013142 2.425477 0.3875136 2.4119503]

總結(jié)

以上是生活随笔為你收集整理的python word2vector (三)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。