當前位置：首頁 > 编程语言 > python >内容正文

python

python生成相似句子_4种方法计算句子相似度

發(fā)布時間：2023/12/13 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 python生成相似句子_4种方法计算句子相似度小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Edit Distance

計算兩個字符串之間，由一個轉成另一個所需要的最少編輯次數(shù)，次數(shù)越多，距離越大，也就越不相關。比如，“xiaoming”和“xiamin”，兩者的轉換需要兩步：

去除‘o’

去除‘g’

所以，次數(shù)/距離=2。

!pip install distance

import distance

def edit_distance(s1, s2):

return distance.levenshtein(s1, s2)

s1 = 'xiaoming'

s2 = 'xiamin'

print('距離：'+str(edit_distance(s1, s2)))

杰卡德系數(shù)

用于比較有限樣本集之間的相似性與差異性。Jaccard 系數(shù)值越大，樣本相似度越高，計算方式是：兩個樣本的交集除以并集。

from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

def jaccard_similarity(s1, s2):

def add_space(s):

return ' '.join(list(s))

# 將字中間加入空格

s1, s2 = add_space(s1), add_space(s2)

# 轉化為TF矩陣

cv = CountVectorizer(tokenizer=lambda s: s.split())

corpus = [s1, s2]

vectors = cv.fit_transform(corpus).toarray()

# 求交集

numerator = np.sum(np.min(vectors, axis=0))

# 求并集

denominator = np.sum(np.max(vectors, axis=0))

# 計算杰卡德系數(shù)

return 1.0 * numerator / denominator

s1 = '你在干啥呢'

s2 = '你在干什么呢'

print(jaccard_similarity(s1, s2))

TF 計算

計算矩陣中兩個向量的相似度，即：求解兩個向量夾角的余弦值。

計算公式：cosθ=a·b/|a|*|b|

from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

from scipy.linalg import norm

def tf_similarity(s1, s2):

def add_space(s):

return ' '.join(list(s))

# 將字中間加入空格

s1, s2 = add_space(s1), add_space(s2)

# 轉化為TF矩陣

cv = CountVectorizer(tokenizer=lambda s: s.split())

corpus = [s1, s2]

vectors = cv.fit_transform(corpus).toarray()

# 計算TF系數(shù)

return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))

s1 = '你在干啥呢'

s2 = '你在干什么呢'

print(tf_similarity(s1, s2))

高階模型Bert

Bert的內(nèi)部結構，請查看從word2vec到bert這篇文章，本篇文章我們只講代碼實現(xiàn)。我們可以下載Bert模型源碼，或者使用TF-HUB的方式使用，本次我們使用下載源碼的方式。

首先，從Github下載源碼，然后下載google預訓練好的模型，我們選擇Bert-base Chinese。

預模型下載后解壓，文件結構如圖：

vocab.txt是訓練時中文文本采用的字典，bert_config.json是BERT在訓練時，可選調(diào)整的一些參數(shù)。其它文件是模型結構，參數(shù)等文件。

準備數(shù)據(jù)集

修改 processor

class MoveProcessor(DataProcessor):

"""Processor for the move data set ."""

def get_train_examples(self, data_dir):

"""See base class."""

return self._create_examples(

self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

def get_dev_examples(self, data_dir):

"""See base class."""

return self._create_examples(

self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

def get_test_examples(self, data_dir):

"""See base class."""

return self._create_examples(

self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

def get_labels(self):

"""See base class."""

return ["0", "1"]

@classmethod

def _read_tsv(cls, input_file, quotechar=None):

"""Reads a tab separated value file."""

with tf.gfile.Open(input_file, "r") as f:

reader = csv.reader(f, delimiter="\t", quotechar=quotechar)

lines = []

for line in reader:

lines.append(line)

return lines

def _create_examples(self, lines, set_type):

"""Creates examples for the training and dev sets."""

examples = []

for (i, line) in enumerate(lines):

guid = "%s-%s" % (set_type, i)

if set_type == "test":

text_a = tokenization.convert_to_unicode(line[0])

label = "0"

else:

text_a = tokenization.convert_to_unicode(line[1])

label = tokenization.convert_to_unicode(line[0])

examples.append(

InputExample(guid=guid, text_a=text_a, text_b=None, label=label))

return examples

修改 processor字典

def main(_):

tf.logging.set_verbosity(tf.logging.INFO)

processors = {

"cola": ColaProcessor,

"mnli": MnliProcessor,

"mrpc": MrpcProcessor,

"xnli": XnliProcessor,

'setest':MoveProcessor

}

Bert模型訓練

export BERT_BASE_DIR=/Users/xiaomingtai/Downloads/chinese_L-12_H-768_A-12

export MY_DATASET=/Users/xiaomingtai/Downloads/bert_model

python run_classifier.py \

--data_dir=$MY_DATASET \

--task_name=setest \

--vocab_file=$BERT_BASE_DIR/vocab.txt \

--bert_config_file=$BERT_BASE_DIR/bert_config.json \

--output_dir=/Users/xiaomingtai/Downloads/ber_model_output/ \

--do_train=true \

--do_eval=true \

--do_predict=true\

--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \

--max_seq_length=128 \

--train_batch_size=16 \

--eval_batch_size=8\

--predict_batch_size=2\

--learning_rate=5e-5\

--num_train_epochs=3.0\

Bert模型訓練結果

總結

以上是生活随笔為你收集整理的python生成相似句子_4种方法计算句子相似度的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：民生开心消消乐联名信用卡额度多少？提额有
下一篇： pythontcp文件传输_python