當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

[深度学习] 自然语言处理--- 基于Keras Bert使用（上）

發布時間：2023/12/15 pytorch 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 [深度学习] 自然语言处理--- 基于Keras Bert使用（上）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. bert? ---- keras

keras_bert 是 CyberZHG 封裝好了Keras版的Bert，可以直接調用官方發布的預訓練權重。

github：https://github.com/CyberZHG/keras-bert

快速安裝：pip install keras-bert?

bert4keras是封裝好了Keras版的Bert，可以直接調用官方發布的預訓練權重。（支持tf2）

github：https://github.com/bojone/bert4keras

快速安裝：pip install git+https://www.github.com/bojone/bert4keras.git

2.keras_bert

2.1.Tokenizer

在 keras-bert 里面，使用?Tokenizer 會將文本拆分成字并生成相應的id。

我們需要提供一個字典，字典存放著 token 和 id 的映射。字典里還有 BERT 里特別的 token。

[CLS]，[SEP]，[UNK]等

在下面的示例中，如果文本拆分出來的字在字典不存在，它的 id 會是 5，代表 [UNK]，即 unknown

我們用同樣的字典，拆分不存在字典中的單詞，結果如下，可以看到英語中會直接把不存在字典中的部分直接按字母拆分。

如果輸入兩句話的例子，encode 函數中我們可以帶上參數 max_len，只看文本拆分出來的 max_len 個字

如果拆分完的字不超過max_len，則用 0 填充

2.2.使用預訓練模型

參考地址：https://github.com/CyberZHG/keras-bert/tree/master/demo

我們可以使用?load_trained_model_from_checkpoint() 函數使用本地已經下載好的預訓練模型，

可以從 BERT 的 github 上獲取下載地址

谷歌BERT地址：https://github.com/google-research/bert

中文預訓練BERT-wwm：https://github.com/ymcui/Chinese-BERT-wwm

下面是使用預訓練模型提取輸入文本的特征

import os# 設置預訓練模型的路徑 pretrained_path = 'chinese_L-12_H-768_A-12' config_path = os.path.join(pretrained_path, 'bert_config.json') checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt') vocab_path = os.path.join(pretrained_path, 'vocab.txt')# 構建字典 # keras_bert 中的 load_vocabulary() 函數傳入 vocab_path 即可 from keras_bert import load_vocabulary token_dict = load_vocabulary(vocab_path)# import codecs # token_dict = {} # with codecs.open(vocab_path, 'r', 'utf8') as reader: # for line in reader: # token = line.strip() # token_dict[token] = len(token_dict)# Tokenization from keras_bert import Tokenizer tokenizer = Tokenizer(token_dict)# 加載預訓練模型 from keras_bert import load_trained_model_from_checkpoint model = load_trained_model_from_checkpoint(config_path, checkpoint_path) text = '語言模型' tokens = tokenizer.tokenize(text) # ['[CLS]', '語', '言', '模', '型', '[SEP]'] indices, segments = tokenizer.encode(first=text, max_len=512) print(indices[:10]) print(segments[:10])# 提取特征 import numpy as np predicts = model.predict([np.array([indices]), np.array([segments])])[0] for i, token in enumerate(tokens):print(token, predicts[i].tolist()[:5])

text1 = '語言模型' text2 = "你好" tokens1 = tokenizer.tokenize(text1) print(tokens1) tokens2 = tokenizer.tokenize(text2) print(tokens2)indices_new, segments_new = tokenizer.encode(first=text1, second=text2 ,max_len=512) print(indices_new[:10]) # [101, 6427, 6241, 3563, 1798, 102, 0, 0, 0, 0] print(segments_new[:10]) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]# 提取特征 import numpy as np predicts_new = model.predict([np.array([indices_new]), np.array([segments_new])])[0] for i, token in enumerate(tokens1):print(token, predicts_new[i].tolist()[:5]) for i, token in enumerate(tokens2):print(token, predicts_new[i].tolist()[:5])

#加載語言模型 model = load_trained_model_from_checkpoint(config_path, checkpoint_path, training=True)token_dict_rev = {v: k for k, v in token_dict.items()}token_ids, segment_ids = tokenizer.encode(u'數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科', max_len=512) # mask掉“技術” token_ids[1] = token_ids[2] = tokenizer._token_dict['[MASK]'] masks = np.array([[0, 1, 1] + [0] * (512 - 3)])# 模型預測被mask掉的部分 predicts = model.predict([np.array([token_ids]), np.array([segment_ids]), masks])[0] pred_indice = probas[0][1:3].argmax(axis=1).tolist() print('Fill with: ', list(map(lambda x: token_dict_rev[x], pred_indice))) # Fill with: ['數', '學']

3 bert4keras

3.1 函數介紹

keras4bert 是基于 keras-bert 重新編寫的一個 keras 版的 bert，

可以適配 albert，只需要在 build_bert_model 函數里加上model='albert'

使用體驗和 keras_bert 差不多，下面是 github 提供的使用例子。

SimpleTokenizer是一個簡單的分詞器，直接將文本分割為單字符序列，專為中文處理設計，原則上只適用于中文模型。

build_bert_model 可用參數如下

config_path：JSON 配置文件路徑
checkpoint_file：checkponit 文件路徑
with_mlm：是否包含 MLM 部分，默認 False
with_pool：是否包含 POOL 部分，默認 False
with_nsp：是否包含 NSP 部分，默認 False
keep_words：要保留的詞ID列表
model：可以定義為 albert 模型, 默認 bert
applications:? 'encoder': BertModel, 'seq2seq': Bert4Seq2seq, 'lm': Bert4LM? , 默認 encoder

3.2 使用預訓練模型

參考地址：https://github.com/bojone/bert4keras/blob/master/examples

我們可以使用?load_trained_model_from_checkpoint() 函數使用本地已經下載好的預訓練模型，

可以從 BERT 的 github 上獲取下載地址

谷歌BERT地址：https://github.com/google-research/bert

中文預訓練BERT-wwm：https://github.com/ymcui/Chinese-BERT-wwm

下面是使用預訓練模型提取輸入文本的特征

import os# 設置預訓練模型的路徑 pretrained_path = 'chinese_L-12_H-768_A-12' config_path = os.path.join(pretrained_path, 'bert_config.json') checkpoint_path = os.path.join(pretrained_path, 'bert_model.ckpt') vocab_path = os.path.join(pretrained_path, 'vocab.txt')from bert4keras.backend import keras, set_gelu from bert4keras.tokenizer import Tokenizer from bert4keras.bert import build_bert_model from bert4keras.optimizers import Adam, extend_with_piecewise_linear_lr from bert4keras.snippets import sequence_padding, DataGenerator from keras.layers import *# import codecs # token_dict = {} # with codecs.open(vocab_path, 'r', 'utf8') as reader: # for line in reader: # token = line.strip() # token_dict[token] = len(token_dict)# 建立分詞器 tokenizer = Tokenizer(vocab_path) # 構建字典 token_dict = tokenizer._token_dict# 加載預訓練模型 model = build_bert_model(config_path=config_path, checkpoint_path=checkpoint_path) text = '語言模型' tokens = tokenizer.tokenize(text) # ['[CLS]', '語', '言', '模', '型', '[SEP]'] indices, segments = tokenizer.encode(text, max_length=512) print(indices[:10]) print(segments[:10])# 提取特征 import numpy as np predicts = model.predict([np.array([indices]), np.array([segments])])[0] for i, token in enumerate(tokens):print(token, predicts[i].tolist()[:5])

text1 = '語言模型' text2 = "你好" tokens1 = tokenizer.tokenize(text1) print(tokens1) tokens2 = tokenizer.tokenize(text2) print(tokens2)indices_new, segments_new = tokenizer.encode(text1, text2 ,max_length=512) print(indices_new[:10]) # [101, 6427, 6241, 3563, 1798, 102, 0, 0, 0, 0] print(segments_new[:10]) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]# 提取特征 import numpy as np predicts_new = model.predict([np.array([indices_new]), np.array([segments_new])])[0] for i, token in enumerate(tokens1):print(token, predicts_new[i].tolist()[:5]) for i, token in enumerate(tokens2):print(token, predicts_new[i].tolist()[:5])

#加載語言模型 model = build_bert_model(config_path=config_path, checkpoint_path=checkpoint_path, with_mlm=True)token_ids, segment_ids = tokenizer.encode(u'科學技術是第一生產力') # mask掉“技術” token_ids[3] = token_ids[4] = token_dict['[MASK]']# 用mlm模型預測被mask掉的部分 probas = model.predict([np.array([token_ids]), np.array([segment_ids])])[0] print(tokenizer.decode(probas[3:5].argmax(axis=1))) # 結果正是“技術”token_ids, segment_ids = tokenizer.encode(u'數學是利用符號語言研究數量、結構、變化以及空間等概念的一門學科') # mask掉“技術” token_ids[1] = token_ids[2] = tokenizer._token_dict['[MASK]']# 用mlm模型預測被mask掉的部分 probas = model.predict([np.array([token_ids]), np.array([segment_ids])])[0] print(tokenizer.decode(probas[1:3].argmax(axis=1))) # 結果正是“數學”

?

總結

以上是生活随笔為你收集整理的[深度学习] 自然语言处理--- 基于Keras Bert使用（上）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [机器学习] 模型稳定度指标PSI
下一篇： alien creeps攻略是什么(Al