五分钟搭建一个基于BERT的NER模型
BERT 簡介
BERT是2018年google 提出來的預(yù)訓(xùn)練的語言模型,并且它打破很多NLP領(lǐng)域的任務(wù)記錄,其提出在nlp的領(lǐng)域具有重要意義。預(yù)訓(xùn)練的(pre-train)的語言模型通過無監(jiān)督的學(xué)習(xí)掌握了很多自然語言的一些語法或者語義知識,之后在做下游的nlp任務(wù)時就會顯得比較容易。BERT在做下游的有監(jiān)督nlp任務(wù)時就像一個做了充足預(yù)習(xí)的學(xué)生去上課,那效果肯定事半功倍。之前的word2vec,glove等Word Embedding技術(shù)也是通過無監(jiān)督的訓(xùn)練讓模型預(yù)先掌握了一些基礎(chǔ)的語言知識,但是word embeding技術(shù)無論從預(yù)訓(xùn)練的模型復(fù)雜度(可以理解成學(xué)習(xí)的能力),以及無監(jiān)督學(xué)習(xí)的任務(wù)難度都無法和BERT相比。
模型部分
首先BERT模型采用的是12層或者24層的雙向的Transformer的Encoder作為特征提取器,如下圖所示。要知道在nlp領(lǐng)域,特征提取能力方面的排序大致是Transformer>RNN>CNN。對Transformer不了解的同學(xué)可以看看筆者之前的這篇文章,而且一用就是12層,將nlp真正的往深度的方向推進了一大步。
?
預(yù)訓(xùn)練任務(wù)方面
BERT 為了讓模型能夠比較好的掌握自然語言方面的知識,提出了下面兩種預(yù)訓(xùn)練的任務(wù):
1.遮蓋詞的預(yù)測任務(wù)(mask word prediction),如下圖所示:
將輸入文本中15%的token隨機遮蓋,然后輸入給模型,最終希望模型能夠輸出遮蓋的詞是什么,這就是讓模型在做完形填空啊,而且還是選項的可能是所有詞的完形填空,想想我們經(jīng)常考試時做完形填空,給你四個選項你都不一定能夠做對,這個任務(wù)可以讓模型學(xué)到很多語法,甚至語義方面的知識。
2.下一個句子預(yù)測任務(wù)(next sentence prediction)
下一個句子預(yù)測任務(wù)如下圖所示:給模型輸入A,B兩個句子,讓模型判斷B句子是否是A句子的下一句。這個任務(wù)是希望模型能夠?qū)W到句子間的關(guān)系,更近一步的加強模型對自然語言的理解。
next sentence prediction
這兩個頗具難度的預(yù)訓(xùn)練任務(wù),讓模型在預(yù)訓(xùn)練階段就對自然語言有了比較深入的學(xué)習(xí)和認知,而這些知識對下游的nlp任務(wù)有著巨大的幫助。當(dāng)然,想要模型通過預(yù)訓(xùn)練掌握知識,我們需要花費大量的語料,大量的計算資源和大量的時間。但是訓(xùn)練一遍就可以一直使用,這種一勞永逸的工作,依然很值得去做一做。
?
BERT的NER實戰(zhàn)
這里筆者先介紹一下kashgari這個框架,此框架的github鏈接在這,封裝這個框架的作者希望大家能夠很方便的調(diào)用一些NLP領(lǐng)域高大上的技術(shù),快速的進行一些實驗。kashgari封裝了BERT embedingg模型,LSTM-CRF實體識別模型,還有一些經(jīng)典的文本分類的網(wǎng)絡(luò)模型。這里筆者就是利用這個框架五分鐘在自己的數(shù)據(jù)集上完成了基于BERT的NER實戰(zhàn)。
數(shù)據(jù)讀入
with open("train_data","rb") as f: data = f.read().decode("utf-8") train_data = data.split("\n\n") train_data = [token.split("\n") for token in train_data] train_data = [[j.split() for j in i ] for i in train_data] train_data.pop() 數(shù)據(jù)預(yù)處理
train_x = [[token[0] for token in sen] for sen in train_data]
train_y = [[token[1] for token in sen] for sen in train_data] 這里 train_x和 train_y都是一個list,
train_x: [[char_seq1],[char_seq2],[char_seq3],..... ]
train_y:[[label_seq1],[label_seq2],[label_seq3],..... ]
其中 char_seq1:["我","愛","荊","州"]
對應(yīng)的的label_seq1:["O","O","B_LOC","I_LOC"]
數(shù)據(jù)預(yù)處理成一個字對應(yīng)一個label就可以了,是不是很方便。kashgari已經(jīng)封裝了數(shù)據(jù)數(shù)值化,向量化的模塊了,所以你已經(jīng)不用操心文本和label的數(shù)值化問題了。這里要強調(diào)一下,由于google開源的BERT中文預(yù)訓(xùn)練模型采用的是字符級別的輸入,所以數(shù)據(jù)預(yù)處理部分只能將文本處理成字符。
載入BERT
只需通過下面三行就可以輕易的加載BERT模型。
from kashgari.embeddings import BERTEmbedding
from kashgari.tasks.seq_labeling import BLSTMCRFModel embedding = BERTEmbedding("bert-base-chinese", 200) 運行后,代碼會自動到BERT模型儲存的地方下載預(yù)訓(xùn)練模型的參數(shù),這里谷歌已經(jīng)幫我們預(yù)訓(xùn)練好BERT模型了,所以我們只需要用它做下游任務(wù)即可。
download_bert_chinese
搭建模型并訓(xùn)練
使用kashgari封裝的LSTM+CRF模型,將數(shù)據(jù)喂給模型,同時設(shè)置好batch-size,就可以訓(xùn)練起來了。感覺整個過程是不是不需要五分鐘(當(dāng)然如果網(wǎng)速慢,下載預(yù)訓(xùn)練的BERT模型可能就超過五分鐘了)。
model = BLSTMCRFModel(embedding)
model.fit(train_x,train_y,epochs=1,batch_size=100)
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
Input-Token (InputLayer) (None, 200) 0 __________________________________________________________________________________________________ Input-Segment (InputLayer) (None, 200) 0 __________________________________________________________________________________________________ Embedding-Token (TokenEmbedding [(None, 200, 768), ( 16226304 Input-Token[0][0] __________________________________________________________________________________________________ Embedding-Segment (Embedding) (None, 200, 768) 1536 Input-Segment[0][0] __________________________________________________________________________________________________ Embedding-Token-Segment (Add) (None, 200, 768) 0 Embedding-Token[0][0] Embedding-Segment[0][0] __________________________________________________________________________________________________ Embedding-Position (PositionEmb (None, 200, 768) 153600 Embedding-Token-Segment[0][0] __________________________________________________________________________________________________ Embedding-Dropout (Dropout) (None, 200, 768) 0 Embedding-Position[0][0] __________________________________________________________________________________________________ Embedding-Norm (LayerNormalizat (None, 200, 768) 1536 Embedding-Dropout[0][0] __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 2362368 Embedding-Norm[0][0] __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-1-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 0 Embedding-Norm[0][0] Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-1-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-1-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-1-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-1-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-1-MultiHeadSelfAttention- Encoder-1-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-1-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-1-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-1-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-2-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-1-FeedForward-Norm[0][0] Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-2-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-2-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-2-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-2-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-2-MultiHeadSelfAttention- Encoder-2-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-2-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-2-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-2-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-3-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-2-FeedForward-Norm[0][0] Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-MultiHeadSelfAttentio (None, 200, 768) 1536 Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-FeedForward (FeedForw (None, 200, 768) 4722432 Encoder-3-MultiHeadSelfAttention- __________________________________________________________________________________________________ Encoder-3-FeedForward-Dropout ( (None, 200, 768) 0 Encoder-3-FeedForward[0][0] __________________________________________________________________________________________________ Encoder-3-FeedForward-Add (Add) (None, 200, 768) 0 Encoder-3-MultiHeadSelfAttention- Encoder-3-FeedForward-Dropout[0][ __________________________________________________________________________________________________ Encoder-3-FeedForward-Norm (Lay (None, 200, 768) 1536 Encoder-3-FeedForward-Add[0][0] __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 2362368 Encoder-3-FeedForward-Norm[0][0] __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-4-MultiHeadSelfAttention[ __________________________________________________________________________________________________ Encoder-4-MultiHeadSelfAttentio (None, 200, 768) 0 Encoder-3-FeedForward-Norm[0][0] Encoder
總結(jié)
以上是生活随笔為你收集整理的五分钟搭建一个基于BERT的NER模型的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 《Attention is All Yo
- 下一篇: Attention is all you