當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

[深度学习] 自然语言处理 --- Bert开发实战 (Transformers）

發布時間：2023/12/15 pytorch 59 豆豆

生活随笔收集整理的這篇文章主要介紹了 [深度学习] 自然语言处理 --- Bert开发实战 (Transformers）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文主要介紹如果使用huggingface的transformers 2.0 進行NLP的模型訓練

除了transformers，其它兼容tf2.0的bert項目還有：

我的博客里有介紹使用方法? [深度學習] 自然語言處理--- 基于Keras Bert使用（上）

keras-bert（Star:1.4k）支持tf2，但它只支持bert一種預訓練模型

bert4keras （Star:692）支持tf2，bert/roberta/albert的預訓練權重進行finetune

bert-for-tf2（Star:329）只給了tf2.0 pipeline示例

huggingface的transformers也發布了transformers2.0，開始支持tf.2.0的各個預訓練模型，雖然沒有對pytorch支持的那么全面但在我們的場景已經足夠適用了。

transformers github
transformers Doc
transformers Online Demo
Paper: HuggingFace's Transformers: State-of-the-art Natural Language Processing

算法流程

一加載預訓練Bert模型

1. 下載Google原始預訓練Bert模型

（1）先將原始google預訓練的模型文件轉換成pytorch格式

這個命令在安裝transformers時會回到環境變量中。

python convert_bert_original_tf_checkpoint_to_pytorch.py -h

python convert_bert_original_tf_checkpoint_to_pytorch.py \ --tf_checkpoint_path Models/chinese_L-12_H-768_A-12/bert_model.ckpt.index \ --bert_config_file Models/chinese_L-12_H-768_A-12/bert_config.json \ --pytorch_dump_path Models/chinese_L-12_H-768_A-12/pytorch_model.bin

output:

INFO:transformers.modeling_bert:Converting TensorFlow checkpoint from /home/work/Bert/Models/chinese_L-12_H-768_A-12/bert_model.ckpt.index Save PyTorch model to Models/chinese_L-12_H-768_A-12/pytorch_model.bin

在開源代碼庫下面有好多有關轉換的py文件

2 下載transformers預訓練模型

在使用transformers的時候，由于Bert、XLNet的文件都在AWS上存儲，transformers的默認下載地址指向的是AWS，因此在國內下載速度非常慢。需要我們自己手動下載。

1、下載.txt、.json、.bin文件到本地

以Bert為例，相關的.bin文件(預訓練權重)下載地址如下所示：

BERT_PRETRAINED_MODEL_ARCHIVE_MAP = {'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin",'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-pytorch_model.bin",'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin",'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-pytorch_model.bin",'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-pytorch_model.bin",'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-pytorch_model.bin",'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin",'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-pytorch_model.bin",'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-pytorch_model.bin",'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-pytorch_model.bin",'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-pytorch_model.bin",'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-pytorch_model.bin",'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-pytorch_model.bin",'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-pytorch_model.bin",'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-pytorch_model.bin", }

若需要下載.json文件，則下載地址為：

BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = {'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json",'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-config.json",'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json",'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-config.json",'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-config.json",'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-config.json",'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-config.json",'bert-base-german-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-cased-config.json",'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-config.json",'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-config.json",'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-config.json",'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-config.json",'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-config.json",'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-config.json",'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-config.json", }

.txt相關文件(詞表文件)下載地址如下：

PRETRAINED_VOCAB_FILES_MAP = {'vocab_file':{'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt",'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt",'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt",'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt",'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased-vocab.txt",'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt",'bert-base-german-cased': "https://int-deepset-models-bert.s3.eu-central-1.amazonaws.com/pytorch/bert-base-german-cased-vocab.txt",'bert-large-uncased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-vocab.txt",'bert-large-cased-whole-word-masking': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-vocab.txt",'bert-large-uncased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-whole-word-masking-finetuned-squad-vocab.txt",'bert-large-cased-whole-word-masking-finetuned-squad': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-whole-word-masking-finetuned-squad-vocab.txt",'bert-base-cased-finetuned-mrpc': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-finetuned-mrpc-vocab.txt",'bert-base-german-dbmdz-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-cased-vocab.txt",'bert-base-german-dbmdz-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-german-dbmdz-uncased-vocab.txt",} }

清華源還支持huggingface hub自動下載。注意：transformers > 3.1.0 的版本支持下面的 mirror 選項。

只需在 from_pretrained 函數調用中添加 mirror 選項，如：

AutoModel.from_pretrained('bert-base-uncased', mirror='tuna')

目前內置的兩個來源為 tuna 與 bfsu。此外，也可以顯式提供鏡像地址，如：

AutoModel.from_pretrained('bert-base-uncased', mirror='https://mirrors.tuna.tsinghua.edu.cn/hugging-face-models')

直接在網站里面搜索模型，之后點擊List all files in model放在一個文件夾下面導入即可啦~

3 加載transformers預訓練模型

（1）加載轉換后的模型

import logging logging.basicConfig(level=logging.INFO) import tensorflow as tf print("Tensorflow Version:", tf.__version__) import torch print("Pytorch Version:", torch.__version__)

from transformers import *import os pretrained_path = 'Models/chinese_L-12_H-768_A-12' config_path = os.path.join(pretrained_path, 'bert_config.json') vocab_path = os.path.join(pretrained_path, 'vocab.txt')# 加載config config = BertConfig.from_json_file(config_path) # 加載torch原始模型 bert_model = BertModel.from_pretrained(pretrained_path, config=config)# 加載tf原始模型 tfbert_model = TFBertModel.from_pretrained(pretrained_path,from_pt=True, config=config)

發現問題：如果加載為TF2的模型，參數會變少 (請使用 pytorch版本加載轉換后的模型)

首先我們建立一個文件夾，命名為bert-base-uncased，然后將這個三個文件放入這個文件夾，并且對文件進行重命名，重命名時將bert-base-uncased-去除即可。

假設我們訓練文件夾名字為 train.py，我們需要將上面的bert-base-uncased文件夾放到與train.py同級的目錄下面。
若不改名以及調整文件夾位置將會出現:
vocab.txt not found；pytorch_model.bin not found；Model name 'xxx/pytorch_model.bin ' was not found in model name list等錯誤。

之后使用下面的代碼進行測試即可：

UNCASED = './bert-base-uncased' VOCAB = 'vocab.txt' tokenizer=BertTokenizer.from_pretrained(os.path.join(UNCASED,VOCAB)) bert = BertModel.from_pretrained(UNCASED)

?

二? 使用Tokenizer編碼

tokenizer是一個將純文本轉換為編碼的過程，該過程不涉及將詞轉換成為詞向量，僅僅是對純文本進行分詞，并且添加[MASK]、[SEP]、[CLS]標記，然后將這些詞轉換為字典索引。

token編碼inputs

tokenizer = BertTokenizer.from_pretrained(vocab_path)

第一步使用 BERT tokenizer 將單詞首先分割成 tokens。

第二步添加句子分類所需的特殊 tokens(在第一個位置是[CLS]，在句子的末尾是[SEP])。

第三步用嵌入表中的 id 替換每個 token，嵌入表是我們從訓練模型中得到的一個組件。

注意，tokenizer 在一行代碼中完成所有這些步驟：

1. encode(text, ...)：將文本分詞后編碼為包含對應 id 的列表

>>> tokenizer.encode('Hello word!') [101, 8667, 1937, 106, 102]

2. encode_plus(text, ...)：將文本分詞后創建一個包含對應 id，token 類型及是否遮蓋的詞典；

tokenizer.encode_plus('Hello world!') {'input_ids': [101, 8667, 1937, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

3. convert_ids_to_tokens(ids, skip_special_tokens)：將 id 映射為 token

>>> tokenizer.convert_ids_to_tokens(tokens) ['[CLS]', 'Hello', 'word', '!', '[SEP]']

4. decode(token_ids)：將 id 解碼

>>> tokenizer.decode(tokens) '[CLS] Hello word! [SEP]'

5. convert_tokens_to_ids(tokens)：將 token 映射為 id

>>> tokenizer.convert_tokens_to_ids(['[CLS]', 'Hello', 'word', '!', '[SEP]']) [101, 8667, 1937, 106, 102]

tokenizer.encode("a visually stunning rumination on love", add_special_tokens=True)

利用分詞器進行編碼：

encode僅返回input_ids
encode_plus返回所有編碼信息
- input_ids：是單詞在詞典中的編碼
- token_type_ids：區分兩個句子的編碼（上句全為0，下句全為1）
- attention_mask：指定對哪些詞進行self-Attention操作

print(tokenizer.encode('我不喜歡你')) #[101, 2769, 679, 1599, 3614, 872, 102]sen_code = tokenizer.encode_plus('我不喜歡這世界','我只喜歡你') print(sen_code) # {'input_ids': [101, 2769, 679, 1599, 3614, 6821, 686, 4518, 102, 2769, 1372, 1599, 3614, 872, 102], # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], # 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

將input_ids轉化回token：

print(tokenizer.convert_ids_to_tokens(sen_code['input_ids'])) #['[CLS]', '我', '不', '喜', '歡', '這', '世', '界', '[SEP]', '我', '只', '喜', '歡', '你', '[SEP]']

數據集將其作為輸入處理之前，我們需要使用 token id 0 填充更短的句子，從而使所有向量具有相同的大小。填充之后，我們有了一個矩陣/張量，準備傳給 BERT：

三? 使用Bert模型

下載bert-base-chinese的config.json，vocab.txt，pytorch_model.bin三個文件后，放在bert-base-chinese文件夾下，此例中該文件夾放在\home/work\transformers_file\下

根據任務需要，如果不需要為指定任務finetune，可以選擇使用BertModel

import numpy as np import torch from transformers import BertTokenizer, BertConfig, BertForMaskedLM, BertForNextSentencePrediction from transformers import BertModelmodel_name = 'bert-base-chinese' MODEL_PATH = '/home/work/transformers_file/bert-base-chinese/'# a.通過詞典導入分詞器 tokenizer = BertTokenizer.from_pretrained(model_name)# b. 導入配置文件 model_config = BertConfig.from_pretrained(model_name)# 修改配置 model_config.output_hidden_states = True model_config.output_attentions = True# 通過配置和路徑導入模型 bert_model = BertModel.from_pretrained(MODEL_PATH, config=model_config)inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") outputs = model(**inputs) last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple

參數：

input_ids、attention_mask、token_type_ids、position_ids、head_mask、inputs_embeds、encoder_hidden_states、encoder_attention_mask、output_attentions、output_hidden_states

模型至少需要有1個輸入： input_ids 或 input_embeds。

input_ids 就是一連串 token 在字典中的對應id。形狀為 (batch_size, sequence_length)。 token_type_ids 可選。就是 token 對應的句子id，值為0或1（0表示對應的token屬于第一句，1表示屬于第二句）。形狀為(batch_size, sequence_length)。

1? input_ids 就是一連串 token 在字典中的對應id。形狀為 (batch_size, sequence_length)。Bert 的輸入需要用 [CLS] 和 [SEP] 進行標記，開頭用 [CLS]，句子結尾用 [SEP]，各類bert模型對應的輸入格式如下所示：

bert: [CLS] + tokens + [SEP] + padding roberta: [CLS] + prefix_space + tokens + [SEP] + padding distilbert: [CLS] + tokens + [SEP] + padding xlm: [CLS] + tokens + [SEP] + padding xlnet: padding + tokens + [SEP] + [CLS]

2? token_type_ids 可選。就是 token 對應的句子id，值為0或1（0表示對應的token屬于第一句，1表示屬于第二句）。形狀為(batch_size, sequence_length)。如為None則BertModel會默認全為0（即a句）。

tokens：[CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]token_type_ids：0 0 0 0 0 0 0 0 1 1 1 1 1 1tokens：[CLS] the dog is hairy . [SEP]token_type_ids：0 0 0 0 0 0 0

3? attention_mask 可選。各元素的值為 0 或 1 ，避免在 padding 的 token 上計算 attention （1不進行masked，0則masked）。形狀為(batch_size, sequence_length)。如為None則BertModel默認全為1。

4? position_ids 可選。表示 token 在句子中的位置id。形狀為(batch_size, sequence_length)。形狀為(batch_size, sequence_length)。如為None則BertModel會自動生成。

形如[0，1，2，......，seq_length - 1]，

5 head_mask 可選。各元素的值為 0 或 1 ，1 表示 head 有效，0無效。形狀為(num_heads,)或(num_layers, num_heads)。

6 input_embeds 可選。替代 input_ids，我們可以直接輸入 Embedding 后的 Tensor。形狀為(batch_size, sequence_length, embedding_dim)。

7 encoder_hidden_states 可選。encoder 最后一層輸出的隱藏狀態序列，模型配置為 decoder 時使用。形狀為(batch_size, sequence_length, hidden_size)。

8 encoder_attention_mask 可選。避免在 padding 的 token 上計算 attention，模型配置為 decoder 時使用。形狀為(batch_size, sequence_length)。

構建模型

class BertNerModel(TFBertPreTrainedModel):def __init__(self, config, *inputs, **kwargs):super(BERT_NER, self).__init__(config, *inputs, **kwargs)self.bert_layer = TFBertMainLayer(config, name='bert')self.bert_layer.trainable = Falseself.concat_layer = tf.keras.layers.Concatenate(name='concat_bert')def call(self, inputs):outputs = self.bert_layer(inputs)#將后n層的結果相連tensor = self.concat_layer(list(outputs[2][-4:]))

這里給出的是簡要的代碼，可以自行根據任務在bert_layer之后加入RNN等

自定義模型的寫法可以參考官方源碼里的TFBertForSequenceClassification，繼承TFBertPreTrainedModel

self.bert_layer(inputs)的返回值為tuple類型：

最后1層隱藏層的輸出值，shape=(batch_size, max_length, hidden_dimention)

[CLS] 對應的輸出值，shape=(batch_size, hidden_dimention)

只有設置了config.output_hidden_states = True，才有該值，所有隱藏層的輸出值，返回值類型是list 每個list里的值的shape是(batch_size, max_length, hidden_dimention)`

模型的初始化

bert_ner_model = BertNerModel.from_pretrained("bert-base-chinese", output_hidden_states=True)

因為是模型繼承的TFBertPreTrainedModel因此這里初始化使用的父類的方式。第一個參數是要加載預訓練好的模型參數

注意事項

通過設置：self.bert.trainable = False，模型可以更快收斂，減少訓練時間

預測的時候，輸出的數據一定要與max_length一致，否則效果完全不可用，猜測可能是我們只給了，沒有給input_mask，有看到transformers的源碼，如果不給attention_mask，默認全是1的

通過設置 output_hidden_states=True，可以得到隱藏層的結果

參考文獻

Huggingface-transformers項目源碼剖析及Bert命名實體識別實戰
HuggingFace-Transformers
HuggingFace-Transformers系列的下游應用
Tensorflow2.0 基于BERT的開發實戰（huggingface-transformers）
HuggingFace-Transformers系列的介紹以及在下游任務中的使用
手把手教你用Pytorch-Transformers——部分源碼解讀及相關說明（一）
關于transformers庫中不同模型的Tokenizer
NLP學習1 - 使用Huggingface Transformers框架從頭訓練語言模型
Huggingface簡介及BERT代碼淺析
一起讀Bert文本分類代碼 (pytorch篇一）

總結

以上是生活随笔為你收集整理的[深度学习] 自然语言处理 --- Bert开发实战 (Transformers）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [深度学习] 自然语言处理--- 基于K
下一篇： s12猴子大乱斗出装神话装备出什么