[Python人工智能] 三十四.Bert模型 (3)keras-bert库构建Bert模型实现微博情感分析
從本專欄開始,作者正式研究Python深度學(xué)習(xí)、神經(jīng)網(wǎng)絡(luò)及人工智能相關(guān)知識(shí)。前一篇文章開啟了新的內(nèi)容——Bert,首先介紹Keras-bert庫(kù)安裝及基礎(chǔ)用法及文本分類工作。這篇文章將通過keras-bert庫(kù)構(gòu)建Bert模型,并實(shí)現(xiàn)微博情感分析。基礎(chǔ)性文章,希望對(duì)您有所幫助!
這篇文章代碼參考“山陰少年”大佬的博客,并結(jié)合自己的經(jīng)驗(yàn),對(duì)其代碼進(jìn)行了詳細(xì)的復(fù)現(xiàn)和理解。希望對(duì)您有所幫助,尤其是初學(xué)者,也強(qiáng)烈推薦大家關(guān)注這位老師的文章。
- NLP(三十五)使用keras-bert實(shí)現(xiàn)文本多分類任務(wù)
- https://github.com/percent4/keras_bert_text_classification
微博情感預(yù)測(cè)結(jié)果如下所示:
原文: 《長(zhǎng)津湖》這部電影真的非常好看,今天看完好開心,愛了愛了。強(qiáng)烈推薦大家,哈哈!!! 預(yù)測(cè)標(biāo)簽: 喜悅原文: 聽到這個(gè)消息真心難受,很傷心,怎么這么悲劇。保佑保佑,哭 預(yù)測(cè)標(biāo)簽: 哀傷原文: 憤怒,我真的挺生氣的,怒其不爭(zhēng),哀其不幸啊! 預(yù)測(cè)標(biāo)簽: 憤怒文章目錄
- 一.Bert模型引入
- 二.數(shù)據(jù)集介紹
- 三.機(jī)器學(xué)習(xí)微博情感分析
- 四.Bert模型微博情感分析
- 1.模型訓(xùn)練
- 2.模型評(píng)估
- 3.模型預(yù)測(cè)
- 五.總結(jié)
本專欄主要結(jié)合作者之前的博客、AI經(jīng)驗(yàn)和相關(guān)視頻及論文介紹,后面隨著深入會(huì)講解更多的Python人工智能案例及應(yīng)用。基礎(chǔ)性文章,希望對(duì)您有所幫助,如果文章中存在錯(cuò)誤或不足之處,還請(qǐng)海涵!作者作為人工智能的菜鳥,希望大家能與我在這一筆一劃的博客中成長(zhǎng)起來。寫了這么多年博客,嘗試第一個(gè)付費(fèi)專欄,為小寶賺點(diǎn)奶粉錢,但更多博客尤其基礎(chǔ)性文章,還是會(huì)繼續(xù)免費(fèi)分享,該專欄也會(huì)用心撰寫,望對(duì)得起讀者。如果有問題隨時(shí)私聊我,只望您能從這個(gè)系列中學(xué)到知識(shí),一起加油喔~
- Keras下載地址:https://github.com/eastmountyxz/AI-for-Keras
- TensorFlow下載地址:https://github.com/eastmountyxz/AI-for-TensorFlow
前文賞析:
- [Python人工智能] 一.TensorFlow2.0環(huán)境搭建及神經(jīng)網(wǎng)絡(luò)入門
- [Python人工智能] 二.TensorFlow基礎(chǔ)及一元直線預(yù)測(cè)案例
- [Python人工智能] 三.TensorFlow基礎(chǔ)之Session、變量、傳入值和激勵(lì)函數(shù)
- [Python人工智能] 四.TensorFlow創(chuàng)建回歸神經(jīng)網(wǎng)絡(luò)及Optimizer優(yōu)化器
- [Python人工智能] 五.Tensorboard可視化基本用法及繪制整個(gè)神經(jīng)網(wǎng)絡(luò)
- [Python人工智能] 六.TensorFlow實(shí)現(xiàn)分類學(xué)習(xí)及MNIST手寫體識(shí)別案例
- [Python人工智能] 七.什么是過擬合及dropout解決神經(jīng)網(wǎng)絡(luò)中的過擬合問題
- [Python人工智能] 八.卷積神經(jīng)網(wǎng)絡(luò)CNN原理詳解及TensorFlow編寫CNN
- [Python人工智能] 九.gensim詞向量Word2Vec安裝及《慶余年》中文短文本相似度計(jì)算
- [Python人工智能] 十.Tensorflow+Opencv實(shí)現(xiàn)CNN自定義圖像分類案例及與機(jī)器學(xué)習(xí)KNN圖像分類算法對(duì)比
- [Python人工智能] 十一.Tensorflow如何保存神經(jīng)網(wǎng)絡(luò)參數(shù)
- [Python人工智能] 十二.循環(huán)神經(jīng)網(wǎng)絡(luò)RNN和LSTM原理詳解及TensorFlow編寫RNN分類案例
- [Python人工智能] 十三.如何評(píng)價(jià)神經(jīng)網(wǎng)絡(luò)、loss曲線圖繪制、圖像分類案例的F值計(jì)算
- [Python人工智能] 十四.循環(huán)神經(jīng)網(wǎng)絡(luò)LSTM RNN回歸案例之sin曲線預(yù)測(cè)
- [Python人工智能] 十五.無監(jiān)督學(xué)習(xí)Autoencoder原理及聚類可視化案例詳解
- [Python人工智能] 十六.Keras環(huán)境搭建、入門基礎(chǔ)及回歸神經(jīng)網(wǎng)絡(luò)案例
- [Python人工智能] 十七.Keras搭建分類神經(jīng)網(wǎng)絡(luò)及MNIST數(shù)字圖像案例分析
- [Python人工智能] 十八.Keras搭建卷積神經(jīng)網(wǎng)絡(luò)及CNN原理詳解
[Python人工智能] 十九.Keras搭建循環(huán)神經(jīng)網(wǎng)絡(luò)分類案例及RNN原理詳解 - [Python人工智能] 二十.基于Keras+RNN的文本分類vs基于傳統(tǒng)機(jī)器學(xué)習(xí)的文本分類
- [Python人工智能] 二十一.Word2Vec+CNN中文文本分類詳解及與機(jī)器學(xué)習(xí)(RF\DTC\SVM\KNN\NB\LR)分類對(duì)比
- [Python人工智能] 二十二.基于大連理工情感詞典的情感分析和情緒計(jì)算
- [Python人工智能] 二十三.基于機(jī)器學(xué)習(xí)和TFIDF的情感分類(含詳細(xì)的NLP數(shù)據(jù)清洗)
- [Python人工智能] 二十四.易學(xué)智能GPU搭建Keras環(huán)境實(shí)現(xiàn)LSTM惡意URL請(qǐng)求分類
- [Python人工智能] 二十六.基于BiLSTM-CRF的醫(yī)學(xué)命名實(shí)體識(shí)別研究(上)數(shù)據(jù)預(yù)處理
- [Python人工智能] 二十七.基于BiLSTM-CRF的醫(yī)學(xué)命名實(shí)體識(shí)別研究(下)模型構(gòu)建
- [Python人工智能] 二十八.Keras深度學(xué)習(xí)中文文本分類萬字總結(jié)(CNN、TextCNN、LSTM、BiLSTM、BiLSTM+Attention)
- [Python人工智能] 二十九.什么是生成對(duì)抗網(wǎng)絡(luò)GAN?基礎(chǔ)原理和代碼普及(1)
- [Python人工智能] 三十.Keras深度學(xué)習(xí)構(gòu)建CNN識(shí)別阿拉伯手寫文字圖像
- [Python人工智能] 三十一.Keras實(shí)現(xiàn)BiLSTM微博情感分類和LDA主題挖掘分析
- [Python人工智能] 三十二.Bert模型 (1)Keras-bert基本用法及預(yù)訓(xùn)練模型
- [Python人工智能] 三十三.Bert模型 (2)keras-bert庫(kù)構(gòu)建Bert模型實(shí)現(xiàn)文本分類
- [Python人工智能] 三十四.Bert模型 (3)keras-bert庫(kù)構(gòu)建Bert模型實(shí)現(xiàn)微博情感分析
一.Bert模型引入
Bert模型的原理知識(shí)將在后面的文章介紹,主要結(jié)合結(jié)合谷歌論文和模型優(yōu)勢(shì)講解。
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- https://arxiv.org/pdf/1810.04805.pdf
- https://github.com/google-research/bert
BERT(Bidirectional Encoder Representation from Transformers)是一個(gè)預(yù)訓(xùn)練的語言表征模型,是由谷歌AI團(tuán)隊(duì)在2018年提出。該模型在機(jī)器閱讀理解頂級(jí)水平測(cè)試SQuAD1.1中表現(xiàn)出驚人的成績(jī),并且在11種不同NLP測(cè)試中創(chuàng)出最佳成績(jī),包括將GLUE基準(zhǔn)推至80.4%(絕對(duì)改進(jìn)7.6%),MultiNLI準(zhǔn)確度達(dá)到86.7% (絕對(duì)改進(jìn)率5.6%)等。可以預(yù)見的是,BERT將為NLP帶來里程碑式的改變,也是NLP領(lǐng)域近期最重要的進(jìn)展。
Bert強(qiáng)調(diào)了不再像以往一樣采用傳統(tǒng)的單向語言模型或者把兩個(gè)單向語言模型進(jìn)行淺層拼接的方法進(jìn)行預(yù)訓(xùn)練,而是采用新的masked language model(MLM),以致能生成深度的雙向語言表征。其模型框架圖如下所示,后面的文章再詳細(xì)介紹,這里僅作引入,推薦讀者閱讀原文。
二.數(shù)據(jù)集介紹
數(shù)據(jù)描述:
| 數(shù)據(jù)概覽 | 36 萬多條,帶情感標(biāo)注 新浪微博,包含 4 種情感,其中喜悅約 20 萬條,憤怒、厭惡、低落各約 5 萬條 |
| 推薦實(shí)驗(yàn) | 情感/觀點(diǎn)/評(píng)論 傾向性分析 |
| 數(shù)據(jù)來源 | 新浪微博 |
| 原數(shù)據(jù)集 | 微博情感分析數(shù)據(jù)集,網(wǎng)上搜集,具體作者、來源不詳 |
| 數(shù)據(jù)描述 | 微博總體數(shù)目為361744: 喜悅-199496、憤怒-51714、厭惡-55267、低落-55267 |
| 對(duì)應(yīng)類標(biāo) | 0: 喜悅, 1: 憤怒, 2: 厭惡, 3: 低落 |
數(shù)據(jù)示例:
注意,做到實(shí)驗(yàn)分析,作者才發(fā)現(xiàn)“厭惡-55267”和“低落-55267”的數(shù)據(jù)集完全相同,因此我們做三分類問題,更重要的是思想。
下載地址:
- https://github.com/eastmountyxz/Datasets-Text-Mining
參考鏈接:
- https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb
三.機(jī)器學(xué)習(xí)微博情感分析
首先,我們介紹機(jī)器學(xué)習(xí)微博情感分析代碼。
- 讀取數(shù)據(jù)
- 數(shù)據(jù)預(yù)處理(中文分詞)
- TF-IDF計(jì)算
- 分類模型構(gòu)建
- 預(yù)測(cè)及實(shí)驗(yàn)評(píng)估
完整代碼如下:
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*- """ Created on Mon Sep 27 22:21:53 2021 @author: xiuzhang """ import jieba import pandas as pd import numpy as np from collections import Counter from scipy.sparse import coo_matrix from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_reportfrom sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn import svm from sklearn import neighbors from sklearn.naive_bayes import MultinomialNB from sklearn.ensemble import AdaBoostClassifier#----------------------------------------------------------------------------- #讀取數(shù)據(jù) train_path = 'data/weibo_3_moods_train.csv' test_path = 'data/weibo_3_moods_test.csv' types = {0: '喜悅', 1: '憤怒', 2: '哀傷'} pd_train = pd.read_csv(train_path) pd_test = pd.read_csv(test_path) print('訓(xùn)練集數(shù)目(總體):%d' % pd_train.shape[0]) print('測(cè)試集數(shù)目(總體):%d' % pd_test.shape[0])#中文分詞 train_words = [] test_words = [] train_labels = [] test_labels = [] stopwords = ["[", "]", ")", "(", ")", "(", "【", "】", "!", ",", "$","·", "?", ".", "、", "-", "—", ":", ":", "《", "》", "=","。", "…", "“", "?", "”", "~", " ", "-", "+", "\\", "‘","~", ";", "’", "...", "..", "&", "#", "....", ",", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10""的", "和", "之", "了", "哦", "那", "一個(gè)", ]for line in range(len(pd_train)):dict_label = pd_train['label'][line]dict_content = str(pd_train['content'][line]) #float=>str#print(dict_label,dict_content)cut_words = ""data = dict_content.strip("\n")data = data.replace(",", "") #一定要過濾符號(hào) ","否則多列seg_list = jieba.cut(data, cut_all=False)for seg in seg_list:if seg not in stopwords:cut_words += seg + " "#print(cut_words)label = -1if dict_label=="喜悅":label = 0elif dict_label=="憤怒":label = 1elif dict_label=="哀傷":label = 2else:label = -1train_labels.append(label)train_words.append(cut_words) print(len(train_labels),len(train_words)) #209043 209043for line in range(len(pd_test)):dict_label = pd_test['label'][line]dict_content = str(pd_test['content'][line])cut_words = ""data = dict_content.strip("\n")data = data.replace(",", "")seg_list = jieba.cut(data, cut_all=False)for seg in seg_list:if seg not in stopwords:cut_words += seg + " "label = -1if dict_label=="喜悅":label = 0elif dict_label=="憤怒":label = 1elif dict_label=="哀傷":label = 2else:label = -1test_labels.append(label)test_words.append(cut_words) print(len(test_labels),len(test_words)) #97366 97366 print(test_labels[:5]) #['喜悅', '喜悅', '憤怒', '哀傷', '喜悅']#----------------------------------------------------------------------------- #TFIDF計(jì)算 #將文本中的詞語轉(zhuǎn)換為詞頻矩陣 矩陣元素a[i][j] 表示j詞在i類文本下的詞頻 vectorizer = CountVectorizer(min_df=100) #MemoryError控制參數(shù)#該類會(huì)統(tǒng)計(jì)每個(gè)詞語的tf-idf權(quán)值 transformer = TfidfTransformer()#第一個(gè)fit_transform是計(jì)算tf-idf 第二個(gè)fit_transform是將文本轉(zhuǎn)為詞頻矩陣 tfidf = transformer.fit_transform(vectorizer.fit_transform(train_words+test_words)) for n in tfidf[:5]:print(n) print(type(tfidf))#獲取詞袋模型中的所有詞語 word = vectorizer.get_feature_names() for n in word[:10]:print(n) print("單詞數(shù)量:", len(word))#將tf-idf矩陣抽取 元素w[i][j]表示j詞在i類文本中的tf-idf權(quán)重 X = coo_matrix(tfidf, dtype=np.float32).toarray() #稀疏矩陣 print(X.shape) print(X[:10])X_train = X[:len(train_labels)] X_test = X[len(train_labels):] y_train = train_labels y_test = test_labels print(len(X_train),len(X_test),len(y_train),len(y_test))#----------------------------------------------------------------------------- #分類模型 clf = MultinomialNB() #clf = svm.LinearSVC() #clf = LogisticRegression(solver='liblinear') #clf = RandomForestClassifier(n_estimators=10) #clf = neighbors.KNeighborsClassifier(n_neighbors=7) #clf = AdaBoostClassifier() clf.fit(X_train, y_train) print('模型的準(zhǔn)確度:{}'.format(clf.score(X_test, y_test))) pre = clf.predict(X_test) print("分類") print(len(pre), len(y_test)) print(classification_report(y_test, pre, digits=4))輸出結(jié)果如下所示:
訓(xùn)練集數(shù)目(總體):209043 測(cè)試集數(shù)目(總體):97366 Building prefix dict from the default dictionary ... Dumping model to file cache C:\Users\xdtech\AppData\Local\Temp\jieba.cache Loading model cost 0.885 seconds. Prefix dict has been built succesfully.<class 'scipy.sparse.csr.csr_matrix'> 單詞數(shù)量: 6997 (306409, 6997) [[0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.]...[0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.][0. 0. 0. ... 0. 0. 0.]] 209043 97366 209043 97366模型的準(zhǔn)確度:0.6670398290984533 分類 97366 97366precision recall f1-score support0 0.6666 0.9833 0.7945 614531 0.6365 0.1184 0.1997 174612 0.7071 0.1330 0.2240 18452avg / total 0.6689 0.6670 0.5797 97366四.Bert模型微博情感分析
模型框架如下圖所示:
1.模型訓(xùn)練
blog34_kerasbert_train.py
代碼如下:
# -*- coding: utf-8 -*- """ Created on Wed Nov 24 00:09:48 2021 @author: xiuzhang """ import json import codecs import pandas as pd from keras_bert import load_trained_model_from_checkpoint, Tokenizer from keras.layers import * from keras.models import Model from keras.optimizers import Adamimport os import tensorflow as tf os.environ["CUDA_DEVICES_ORDER"] = "PCI_BUS_IS" os.environ["CUDA_VISIBLE_DEVICES"] = "0"#指定了每個(gè)GPU進(jìn)程中使用顯存的上限,0.9表示可以使用GPU 90%的資源進(jìn)行訓(xùn)練 gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.9) sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))maxlen = 300 BATCH_SIZE = 8 config_path = 'chinese_L-12_H-768_A-12/bert_config.json' checkpoint_path = 'chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = 'chinese_L-12_H-768_A-12/vocab.txt'#讀取vocab詞典 token_dict = {} with codecs.open(dict_path, 'r', 'utf-8') as reader:for line in reader:token = line.strip()token_dict[token] = len(token_dict)#------------------------------------------類函數(shù)定義-------------------------------------- #詞典中添加否則Unknown class OurTokenizer(Tokenizer):def _tokenize(self, text):R = []for c in text:if c in self._token_dict:R.append(c)else:R.append('[UNK]') #剩余的字符是[UNK]return R tokenizer = OurTokenizer(token_dict)#數(shù)據(jù)填充 def seq_padding(X, padding=0):L = [len(x) for x in X]ML = max(L)return np.array([np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X])class DataGenerator:def __init__(self, data, batch_size=BATCH_SIZE):self.data = dataself.batch_size = batch_sizeself.steps = len(self.data) // self.batch_sizeif len(self.data) % self.batch_size != 0:self.steps += 1def __len__(self):return self.stepsdef __iter__(self):while True:idxs = list(range(len(self.data)))np.random.shuffle(idxs)X1, X2, Y = [], [], []for i in idxs:d = self.data[i]text = d[0][:maxlen]x1, x2 = tokenizer.encode(first=text)y = d[1]X1.append(x1)X2.append(x2)Y.append(y)if len(X1) == self.batch_size or i == idxs[-1]:X1 = seq_padding(X1)X2 = seq_padding(X2)Y = seq_padding(Y)yield [X1, X2], Y[X1, X2, Y] = [], [], []#構(gòu)建模型 def create_cls_model(num_labels):bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len=None)for layer in bert_model.layers:layer.trainable = Truex1_in = Input(shape=(None,))x2_in = Input(shape=(None,))x = bert_model([x1_in, x2_in])cls_layer = Lambda(lambda x: x[:, 0])(x) #取出[CLS]對(duì)應(yīng)的向量用來做分類p = Dense(num_labels, activation='softmax')(cls_layer) #多分類model = Model([x1_in, x2_in], p)model.compile(loss='categorical_crossentropy',optimizer=Adam(1e-5),metrics=['accuracy'])model.summary()return model#------------------------------------------主函數(shù)----------------------------------------- if __name__ == '__main__':#數(shù)據(jù)預(yù)處理train_df = pd.read_csv("data/weibo_3_moods_train.csv").fillna(value="")test_df = pd.read_csv("data/weibo_3_moods_test.csv").fillna(value="")print("begin data processing...")labels = train_df["label"].unique()print(labels)with open("label.json", "w", encoding="utf-8") as f:f.write(json.dumps(dict(zip(range(len(labels)), labels)), ensure_ascii=False, indent=2))train_data = []test_data = []for i in range(train_df.shape[0]):label, content = train_df.iloc[i, :]label_id = [0] * len(labels)for j, _ in enumerate(labels):if _ == label:label_id[j] = 1train_data.append((content, label_id))print(train_data[0])for i in range(test_df.shape[0]):label, content = test_df.iloc[i, :]label_id = [0] * len(labels)for j, _ in enumerate(labels):if _ == label:label_id[j] = 1test_data.append((content, label_id))print(len(train_data),len(test_data))print("finish data processing!\n")#模型訓(xùn)練model = create_cls_model(len(labels))train_D = DataGenerator(train_data)test_D = DataGenerator(test_data)print("begin model training...")print(len(train_D), len(test_D)) #26131 12171model.fit_generator(train_D.__iter__(),steps_per_epoch=len(train_D),epochs=10,validation_data=test_D.__iter__(),validation_steps=len(test_D))print("finish model training!")#模型保存model.save('cls_mood.h5')print("Model saved!")result = model.evaluate_generator(test_D.__iter__(), steps=len(test_D))print("模型評(píng)估結(jié)果:", result)模型的架構(gòu)如下圖所示,本文調(diào)用GPU實(shí)現(xiàn)。
- [‘哀傷’ ‘喜悅’ ‘憤怒’]
- 209043 97366
訓(xùn)練結(jié)果如下:
Epoch 1/3 15000/15000 [==============================] - 3561s 237ms/step - loss: 0.6973 - acc: 0.6974 - val_loss: 1.2818 - val_acc: 0.6068Epoch 2/3 15000/15000 [==============================] - 3544s 236ms/step - loss: 0.5900 - acc: 0.7523 - val_loss: 1.5190 - val_acc: 0.6007Epoch 3/3 15000/15000 [==============================] - 3545s 236ms/step - loss: 0.4615 - acc: 0.8137 - val_loss: 1.6390 - val_acc: 0.5981finish model training! Model saved!如下圖所示,輸出訓(xùn)練模型h5,約2GB大小。
最終輸出結(jié)果如下:
- 模型評(píng)估結(jié)果: [1.6390499637700617, 0.5981]
問題:單步太慢,整個(gè)訓(xùn)練花費(fèi)了3小時(shí)
If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
原因就是GPU的使用率太高了,顯存不足,將batch_size調(diào)小,占比90%。至今沒找到好的解決方法。
- train_D = DataGenerator(train_data)
26131 - test_D = DataGenerator(test_data)
12171
注意前面的batch_size會(huì)控制我們的批次,比如修改為32后
- 6533 3043
LSTM訓(xùn)練核心代碼如下:
2.模型評(píng)估
blog34_kerasbert_evaluate.py
# -*- coding: utf-8 -*- """ Created on Thu Nov 25 00:09:02 2021 @author: xiuzhang 引用:https://github.com/percent4/keras_bert_text_classification """ import json import numpy as np import pandas as pd from keras.models import load_model from keras_bert import get_custom_objects from sklearn.metrics import classification_report from blog34_kerasbert_train import token_dict, OurTokenizermaxlen = 300#加載訓(xùn)練好的模型 model = load_model("cls_mood.h5", custom_objects=get_custom_objects()) tokenizer = OurTokenizer(token_dict) with open("label.json", "r", encoding="utf-8") as f:label_dict = json.loads(f.read())#單句預(yù)測(cè) def predict_single_text(text):text = text[:maxlen]x1, x2 = tokenizer.encode(first=text) #BERT TokenizeX1 = x1 + [0] * (maxlen - len(x1)) if len(x1) < maxlen else x1X2 = x2 + [0] * (maxlen - len(x2)) if len(x2) < maxlen else x2#print(X1,X2)#模型預(yù)測(cè)predicted = model.predict([[X1], [X2]])y = np.argmax(predicted[0])return label_dict[str(y)]#模型評(píng)估 def evaluate():test_df = pd.read_csv("data/weibo_3_moods_test.csv").fillna(value="")true_y_list, pred_y_list = [], []for i in range(test_df.shape[0]):true_y, content = test_df.iloc[i, :]pred_y = predict_single_text(content)print("predict %d samples" % (i+1))print(true_y,pred_y)true_y_list.append(true_y)pred_y_list.append(pred_y)return classification_report(true_y_list, pred_y_list, digits=4)#------------------------------------模型評(píng)估--------------------------------- output_data = evaluate() print("model evaluate result:\n") print(output_data)輸出結(jié)果如下所示:
這預(yù)測(cè)結(jié)果低得可怕,哈哈!可能和我數(shù)據(jù)集標(biāo)注有關(guān),好處是數(shù)據(jù)預(yù)測(cè)比較分散,而不是某個(gè)類別極高,其模型遷移效果如何呢?讀者可以嘗試,尤其是在質(zhì)量更高的數(shù)據(jù)集上實(shí)驗(yàn)。
輸出結(jié)果如下:
model evaluate result:precision recall f1-score support哀傷 0.4162 0.4301 0.4230 18452喜悅 0.7957 0.4244 0.5535 61453憤怒 0.2629 0.6854 0.3800 17461avg / total 0.6282 0.4723 0.4977 973663.模型預(yù)測(cè)
blog34_kerasbert_predict.py
# -*- coding: utf-8 -*- """ Created on Thu Nov 25 00:10:06 2021 @author: xiuzhang 引用:https://github.com/percent4/keras_bert_text_classification """ import time import json import numpy as npfrom blog34_kerasbert_train import token_dict, OurTokenizer from keras.models import load_model from keras_bert import get_custom_objectsmaxlen = 256 s_time = time.time()#加載訓(xùn)練好的模型 model = load_model("cls_mood.h5", custom_objects=get_custom_objects()) tokenizer = OurTokenizer(token_dict) with open("label.json", "r", encoding="utf-8") as f:label_dict = json.loads(f.read())#預(yù)測(cè)示例語句 text = "《長(zhǎng)津湖》這部電影真的非常好看,今天看完好開心,愛了愛了。強(qiáng)烈推薦大家,哈哈!!!" #text = "聽到這個(gè)消息真心難受,很傷心,怎么這么悲劇。保佑保佑,哭" #text = "憤怒,我真的挺生氣的,怒其不爭(zhēng),哀其不幸啊!"#Tokenize text = text[:maxlen] x1, x2 = tokenizer.encode(first=text) X1 = x1 + [0] * (maxlen-len(x1)) if len(x1) < maxlen else x1 X2 = x2 + [0] * (maxlen-len(x2)) if len(x2) < maxlen else x2#模型預(yù)測(cè) predicted = model.predict([[X1], [X2]]) y = np.argmax(predicted[0]) e_time = time.time() print("原文: %s" % text) print("預(yù)測(cè)標(biāo)簽: %s" % label_dict[str(y)]) print("Cost time:", e_time-s_time)輸出結(jié)果如下所示,可以看到準(zhǔn)確對(duì)三種類型的評(píng)價(jià)進(jìn)行預(yù)測(cè)。
原文: 《長(zhǎng)津湖》這部電影真的非常好看,今天看完好開心,愛了愛了。強(qiáng)烈推薦大家,哈哈!!! 預(yù)測(cè)標(biāo)簽: 喜悅原文: 聽到這個(gè)消息真心難受,很傷心,怎么這么悲劇。保佑保佑,哭 預(yù)測(cè)標(biāo)簽: 哀傷原文: 憤怒,我真的挺生氣的,怒其不爭(zhēng),哀其不幸啊! 預(yù)測(cè)標(biāo)簽: 憤怒五.總結(jié)
寫到這里,這篇文章就介紹結(jié)束了,后面還會(huì)持續(xù)分享,包括Bert實(shí)現(xiàn)命名實(shí)體識(shí)別及原理知識(shí)。真心希望這篇文章對(duì)您有所幫助,加油~
- 一.Bert模型引入
- 二.數(shù)據(jù)集介紹
- 三.機(jī)器學(xué)習(xí)微博情感分析
- 四.Bert模型微博情感分析
1.模型訓(xùn)練
2.模型評(píng)估
3.模型預(yù)測(cè)
下載地址:
- https://github.com/eastmountyxz/AI-for-Keras
- https://github.com/eastmountyxz/AI-for-TensorFlow
(By:Eastmount 2021-12-06 夜于武漢 http://blog.csdn.net/eastmount/ )
參考文獻(xiàn):
- [1] https://github.com/google-research/bert
- [2] https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
- [3] https://github.com/percent4/keras_bert_text_classification
- [4] https://github.com/huanghao128/bert_example
- [5] 如何評(píng)價(jià) BERT 模型? - 知乎
- [6] 【NLP】Google BERT模型原理詳解 - 李rumor
- [7] NLP(三十五)使用keras-bert實(shí)現(xiàn)文本多分類任務(wù) - 山陰少年
- [8] tensorflow2+keras簡(jiǎn)單實(shí)現(xiàn)BERT模型 - 小黃
- [9] NLP必讀 | 十分鐘讀懂谷歌BERT模型 - 奇點(diǎn)機(jī)智
- [10] BERT模型的詳細(xì)介紹 - IT小佬
- [11] [深度學(xué)習(xí)] 自然語言處理— 基于Keras Bert使用(上)- 天空是很藍(lán)
- [12] https://github.com/CyberZHG/keras-bert
- [13] https://github.com/bojone/bert4keras
- [14] https://github.com/ymcui/Chinese-BERT-wwm
- [15] 用深度學(xué)習(xí)做命名實(shí)體識(shí)別(六)-BERT介紹 - 滌生
- [16] https://blog.csdn.net/qq_36949278/article/details/117637187
總結(jié)
以上是生活随笔為你收集整理的[Python人工智能] 三十四.Bert模型 (3)keras-bert库构建Bert模型实现微博情感分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [Python人工智能] 三十三.Ber
- 下一篇: [系统安全] 四十一.APT系列(6)P