當(dāng)前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

[深度学习] 自然语言处理 --- Huggingface-Pytorch中文语言Bert模型预训练

發(fā)布時間：2023/12/15 pytorch 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 [深度学习] 自然语言处理 --- Huggingface-Pytorch中文语言Bert模型预训练小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Hugging face 是一家總部位于紐約的聊天機(jī)器人初創(chuàng)服務(wù)商，開發(fā)的應(yīng)用在青少年中頗受歡迎，相比于其他公司，Hugging Face更加注重產(chǎn)品帶來的情感以及環(huán)境因素。官網(wǎng)鏈接在此 https://huggingface.co/ 。

但更令它廣為人知的是Hugging Face專注于NLP技術(shù)，擁有大型的開源社區(qū)。尤其是在github上開源的自然語言處理，預(yù)訓(xùn)練模型庫 Transformers，已被下載超過一百萬次，github上超過24000個star。Transformers 提供了NLP領(lǐng)域大量state-of-art的預(yù)訓(xùn)練語言模型結(jié)構(gòu)的模型和調(diào)用框架。以下是repo的鏈接（https://github.com/huggingface/transformers）

這個庫最初的名稱是pytorch-pretrained-bert，它隨著BERT一起應(yīng)運(yùn)而生。Google2018年10月底在 https://github.com/google-research/bert 開源了BERT的tensorflow實現(xiàn)。當(dāng)時，BERT以其強(qiáng)勁的性能，引起NLPer的廣泛關(guān)注。幾乎與此同時，pytorch-pretrained-bert也開始了它的第一次提交。pytorch-pretrained-bert 用當(dāng)時已有大量支持者的pytorch框架復(fù)現(xiàn)了BERT的性能，并提供預(yù)訓(xùn)練模型的下載，使沒有足夠算力的開發(fā)者們也能夠在幾分鐘內(nèi)就實現(xiàn) state-of-art-fine-tuning。到目前為止，transformers 提供了超過100種語言的，32種預(yù)訓(xùn)練語言模型，簡單，強(qiáng)大，高性能，是新手入門的不二選擇。

CL2020 Best Paper有一篇論文提名獎，《Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks》。這篇論文做了很多語言模型預(yù)訓(xùn)練的實驗，系統(tǒng)的分析了語言模型預(yù)訓(xùn)練對子任務(wù)的效果提升情況。有幾個主要結(jié)論：

在目標(biāo)領(lǐng)域的數(shù)據(jù)集上繼續(xù)預(yù)訓(xùn)練（DAPT）可以提升效果；目標(biāo)領(lǐng)域的語料與RoBERTa的原始預(yù)訓(xùn)練語料越不相關(guān)，DAPT效果則提升更明顯。
在具體任務(wù)的數(shù)據(jù)集上繼續(xù)預(yù)訓(xùn)練（TAPT）可以十分“廉價”地提升效果。
結(jié)合二者（先進(jìn)行DAPT，再進(jìn)行TAPT）可以進(jìn)一步提升效果。
如果能獲取更多的、任務(wù)相關(guān)的無標(biāo)注數(shù)據(jù)繼續(xù)預(yù)訓(xùn)練（Curated-TAPT），效果則最佳。
如果無法獲取更多的、任務(wù)相關(guān)的無標(biāo)注數(shù)據(jù)，采取一種十分輕量化的簡單數(shù)據(jù)選擇策略，效果也會提升。

知乎專欄《高能NLP》 https://zhuanlan.zhihu.com/p/149210123

雖然在bert上語言模型預(yù)訓(xùn)練在算法比賽中已經(jīng)是一個穩(wěn)定的上分操作。但是上面這篇文章難能可貴的是對這個操作進(jìn)行了系統(tǒng)分析。大部分中文語言模型都是在tensorflow上訓(xùn)練的，一個常見例子是中文roberta項目。可以參考

https://github.com/brightmart/roberta_zh

使用pytorch進(jìn)行中文bert語言模型預(yù)訓(xùn)練的例子比較少。在huggingface的Transformers中，有一部分代碼支持語言模型預(yù)訓(xùn)練(不是很豐富，很多功能都不支持比如wwm)。
?

為了用最少的代碼成本完成bert語言模型預(yù)訓(xùn)練，本文借鑒了里面的一些現(xiàn)成代碼。也嘗試分享一下使用pytorch進(jìn)行語言模型預(yù)訓(xùn)練的一些經(jīng)驗。主要有三個常見的中文bert語言模型

bert-base-chinese

roberta-wwm-ext

ernie

1?bert-base-chinese

(https://huggingface.co/bert-base-chinese)

這是最常見的中文bert語言模型，基于中文維基百科相關(guān)語料進(jìn)行預(yù)訓(xùn)練。把它作為baseline，在領(lǐng)域內(nèi)無監(jiān)督數(shù)據(jù)進(jìn)行語言模型預(yù)訓(xùn)練很簡單。只需要使用官方給的例子就好。

https://github.com/huggingface/transformers/tree/master/examples/language-modeling

(本文使用的transformers更新到3.0.2)

其中$TRAIN_FILE 代表領(lǐng)域相關(guān)中文語料地址。

python run_language_modeling.py \--output_dir=output \--model_type=bert \--model_name_or_path=bert-base-chinese \--do_train \--train_data_file=$TRAIN_FILE \--do_eval \--eval_data_file=$TEST_FILE \--mlm

2?roberta-wwm-ext

(https://github.com/ymcui/Chinese-BERT-wwm)

哈工大訊飛聯(lián)合實驗室發(fā)布的預(yù)訓(xùn)練語言模型。預(yù)訓(xùn)練的方式是采用roberta類似的方法，比如動態(tài)mask，更多的訓(xùn)練數(shù)據(jù)等等。在很多任務(wù)中，該模型效果要優(yōu)于bert-base-chinese。

對于中文roberta類的pytorch模型，使用方法如下

import torch from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext") roberta = BertModel.from_pretrained("hfl/chinese-roberta-wwm-ext")

切記不可使用官方推薦的

tokenizer = AutoTokenizer.from_pretrained("hfl/chinese-roberta-wwm-ext") model = AutoModel.from_pretrained("hfl/chinese-roberta-wwm-ext")

因為中文roberta類的配置文件比如vocab.txt，都是采用bert的方法設(shè)計的。英文roberta模型讀取配置文件的格式默認(rèn)是vocab.json。對于一些英文roberta模型，倒是可以通過AutoModel自動讀取。這就解釋了huggingface的模型庫的中文roberta示例代碼為什么跑不通。

如果要基于上面的代碼run_language_modeling.py繼續(xù)預(yù)訓(xùn)練roberta。還需要做兩個改動。

下載roberta-wwm-ext到本地目錄hflroberta，在config.json中修改“model_type”:"roberta"為"model_type":"bert"。
對上面的run_language_modeling.py中的AutoModel和AutoTokenizer都進(jìn)行替換為BertModel和BertTokenizer。

python run_language_modeling_roberta.py \--output_dir=output \--model_type=bert \--model_name_or_path=hflroberta \--do_train \--train_data_file=$TRAIN_FILE \--do_eval \--eval_data_file=$TEST_FILE \--mlm

3 ernie

（https://github.com/nghuyong/ERNIE-Pytorch）

ernie是百度發(fā)布的基于百度知道貼吧等中文語料結(jié)合實體預(yù)測等任務(wù)生成的預(yù)訓(xùn)練模型。這個模型的準(zhǔn)確率在某些任務(wù)上要優(yōu)于bert-base-chinese和roberta。如果基于ernie1.0模型做領(lǐng)域數(shù)據(jù)預(yù)訓(xùn)練的話只需要一步修改。

下載ernie1.0到本地目錄ernie，在config.json中增加字段"model_type":"bert"。

python run_language_modeling.py \--output_dir=output \--model_type=bert \--model_name_or_path=ernie \--do_train \--train_data_file=$TRAIN_FILE \--do_eval \--eval_data_file=$TEST_FILE \--mlm

最后，huggingface項目中語言模型預(yù)訓(xùn)練用mask方式如下。仍是按照15%的數(shù)據(jù)隨機(jī)mask然后預(yù)測自身。如果要做一些高級操作比如whole word masking或者實體預(yù)測，可以自行修改transformers.DataCollatorForLanguageModeling。

def mask_tokens(self, inputs: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:"""Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original."""if self.tokenizer.mask_token is None:raise ValueError("This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer.")labels = inputs.clone()# We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)probability_matrix = torch.full(labels.shape, self.mlm_probability)special_tokens_mask = [self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)if self.tokenizer._pad_token is not None:padding_mask = labels.eq(self.tokenizer.pad_token_id)probability_matrix.masked_fill_(padding_mask, value=0.0)masked_indices = torch.bernoulli(probability_matrix).bool()labels[~masked_indices] = -100 # We only compute loss on masked tokens# 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indicesinputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)# 10% of the time, we replace masked input tokens with random wordindices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replacedrandom_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)inputs[indices_random] = random_words[indices_random]# The rest of the time (10% of the time) we keep the masked input tokens unchangedreturn inputs, labels

本文實驗代碼庫。拿來即用！

https://github.com/zhusleep/pytorch_chinese_lm_pretrain

總結(jié)

以上是生活随笔為你收集整理的[深度学习] 自然语言处理 --- Huggingface-Pytorch中文语言Bert模型预训练的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： DedeCMS的织梦专题功能如何实现
下一篇：造成服务器无法正常运行的原因有哪些