生活随笔
收集整理的這篇文章主要介紹了
simbert文本相似度,短文本语义匹配模型
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
simbert文本相似語義召回;保存及在線服務https://blog.csdn.net/weixin_42357472/article/details/116205077
SimBERT(基于UniLM思想、融檢索與生成于一體的BERT模型)【主要應用場景:相似文本生成、相似文本檢索】
https://blog.csdn.net/u013250861/article/details/123649047
import numpy
as np
import os
from collections
import Counter
os
.environ
['TF_KERAS'] = '1'
from bert4keras
.backend
import keras
, K
from bert4keras
.models
import build_transformer_model
from bert4keras
.tokenizers
import Tokenizer
from bert4keras
.snippets
import sequence_padding
from bert4keras
.snippets
import uniout
from keras
.models
import Modelmaxlen
= 32
config_path
= r
'D:***t\chinese_simbert_L-6_H-384_A-12\bert_config.json'
checkpoint_path
= r
'D:\*****rt\chinese_simbert_L-6_H-384_A-12\bert_model.ckpt'
dict_path
= r
'D:\****rt\chinese_simbert_L-6_H-384_A-12\vocab.txt'
tokenizer
= Tokenizer
(dict_path
, do_lower_case
=True)
bert
= build_transformer_model
(config_path
,checkpoint_path
,with_pool
='linear',application
='unilm',return_keras_model
=False,
)encoder
= keras
.models
.Model
(bert
.model
.inputs
, bert
.model
.outputs
[0])import pandas
as pd
datas1
= pd
.read_csv
(r
'D:****raw_datas150.csv')
datas_all
= list(datas1
["title"])
data
= datas_all
a_token_ids
, b_token_ids
, labels
= [], [], []
texts
= []for d
in data
:token_ids
= tokenizer
.encode
(d
, maxlen
=maxlen
)[0]a_token_ids
.append
(token_ids
)
texts
.append
(d
)a_token_ids
= sequence_padding
(a_token_ids
)
a_vecs
= encoder
.predict
([a_token_ids
, np
.zeros_like
(a_token_ids
)],verbose
=True)
a_vecs
= a_vecs
/ (a_vecs
**2).sum(axis
=1, keepdims
=True)**0.5print(type(a_vecs
))
np
.save
("sim_all_datas.npy",a_vecs
)
def most_similar(text
, topn
=10):"""檢索最相近的topn個句子"""token_ids
, segment_ids
= tokenizer
.encode
(text
, max_length
=maxlen
)print(token_ids
, segment_ids
)vec
= encoder
.predict
([[token_ids
], [segment_ids
]])[0]vec
/= (vec
**2).sum()**0.5sims
= np
.dot
(a_vecsss
, vec
)return [(i
, datas_all
[i
], sims
[i
]) for i
in sims
.argsort
()[::-1][:topn
]]kk
=["海綿寶寶"]
mmm
= []
for i
in kk
:results
= most_similar
(i
, 10)mmm
.append
([i
,results
])print(i
,results
)
from paddlenlp
import Taskflow
similarity
= Taskflow
("text_similarity")
[2022-03-22 15:17:18,306] [ INFO
] - Downloading model_state
.pdparams
from [https
://bj
.bcebos
.com
/paddlenlp
/taskflow
/text_similarity
/simbert
-base
-chinese
/model_state
.pdparams
](https
://bj
.bcebos
.com
/paddlenlp
/taskflow
/text_similarity
/simbert
-base
-chinese
/model_state
.pdparams
)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
| 615M
/615M
[00:29<00:00, 22.1MB
/s
]
[2022-03-22 15:17:51,977] [ INFO
] - Downloading model_config
.json
from [https
://bj
.bcebos
.com
/paddlenlp
/taskflow
/text_similarity
/simbert
-base
-chinese
/model_config
.json
](https
://bj
.bcebos
.com
/paddlenlp
/taskflow
/text_similarity
/simbert
-base
-chinese
/model_config
.json
)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
| 334/334 [00:00<00:00, 197kB
/s
]
[2022-03-22 15:17:52,154] [ INFO
] - Downloading https
://bj
.bcebos
.com
/paddlenlp
/models
/transformers
/simbert
/vocab
.txt
and saved to
/root
/.paddlenlp
/models
/simbert
-base
-chinese
[2022-03-22 15:17:52,154] [ INFO
] - Downloading vocab
.txt
from [https
://bj
.bcebos
.com
/paddlenlp
/models
/transformers
/simbert
/vocab
.txt
](https
://bj
.bcebos
.com
/paddlenlp
/models
/transformers
/simbert
/vocab
.txt
)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████
| 63.4k
/63.4k
[00:00<00:00, 744kB
/s
]
[2022-03-22 15:18:10,818] [ INFO
] - Weights
from pretrained model
not used
in BertModel
: ['cls.predictions.decoder_bias', 'cls.predictions.transform.weight', 'cls.predictions.transform.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder_weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
[2022-03-22 15:18:12,113] [ INFO
] - Converting to the inference model cost a little time
.
[2022-03-22 15:18:30,093] [ INFO
] - The inference model save
in the path
:/root
/.paddlenlp
/taskflow
/text_similarity
/simbert
-base
-chinese
/static
/inference
總結
以上是生活随笔為你收集整理的simbert文本相似度,短文本语义匹配模型的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。